A few weeks ago, we had near-simultaneous disk drive failures in two of our Linux PCs: both desktops are over four years old, and they’ve been spinning nearly 100% of that time. Seeing independent hard drives fail at roughly the same time, while perhaps a bit unusual, is not unheard-of… and it did happen. So, we’ve been limping along using and sharing a single Ubuntu laptop for awhile, during which I (as the sole-source of IT services in our household) have been contemplating the best path to recovery and rebuilding these two desktop PCs. I’ve successfully completed the first of these system recoveries over this past weekend. (My dear spouse is put out with me because I did my PC, not hers, first… “Honey, I had to practice on mine in case I blew something up!”)
I am documenting this system rebuild experience here, in a series of posts, for several reasons:
- Documenting system rebuilds will likely help me (and others) with a future system rebuild/recovery, as such things do happen at unexpected times… and hardware will certainly give up the ghost again, when least expected and least convenient.
- Researching and collecting system diagnosis and recovery information on the ‘Net is tedious and frustrating, with lots of misleading, opinionated, irrelevant, incomplete, and otherwise just-plain-wrong information out there — which obscures the rare and precious gems-of-wisdom items which are correct, on-target and relevant to a given situation. Recording this learned wisdom here will capture the nuggets of wisdom, saving time and effort of re-sifting through all that bad dross again in the future.
- System failures occur unpredictably and infrequently, and “my memory ain’t what it used to be.” When the next one happens (and it will), I’ll need these breadcrumbs of experience to help me recover my current brilliance for the next set of repairs.
- Based on the wild success of my previous blog posting, “If It Ain’t Broke — Or How I Upgraded my DSL Service in 12 Not-So-Easy Steps” (part 1 and part 2), writing a complete narrative of problem-research-fix-&-recovery can be helpful to other folks. Hopefully, this series of posts will serve a similar purpose and help save time and effort for others…
This series of posts will try to provide a coherent narrative of “how I fixed my PC”… Though specific to my own situation, they hopefully will be of general use and guidance for others faced with comparable disaster recovery problems.
And please note: What I’m addressing herein is a self-help approach for the “sole proprietor”-type of IT support gal or guy… what is appropriate to a small home office or home-based (the so-called SOHO) kind of computing installation, whether just a single PC or a home-network of devices. The larger needs of a business, commercial, industrial or military type of systems installation are, obviously, not directly addressed, as these situations demand a different level of considerations, including fail-safe levels of reliability, availability, security, integrity, legal issues and sheer scale, even if the basic priniciples herein may generally apply. Caveat emptor.
We’re (nearly) 100% pure Linux at our home/office, for good and many reasons… and no, this is not going to be another anti-Windows rant (my daughter keeps an XP-based laptop around to access a couple of her business-oriented websites which absolutely refuse to play nicely with a Linux/FOSS-based browser of any sort — dumb, IE-centric web designers! Oops… </mini-rant>). Free and Open-Source Software (FOSS) just happens to satisfy all of our computing needs — it’s nice to not have to pay (and pay, and pay… </mini-rant>. Really.) for software and support. And our next cellphone upgrade will no doubt be an Android-based unit… but I digress…
I’ve been professionally involved with technology and IT since I graduated from college in 1974 — over 40 years. Yup, I’ve lived through and participated in a lot of computer history. Up until around 2006, my career centered around what are (sometimes disparagingly) called “proprietary systems” or “big iron” — computer technology provided to business and industry by such heavy-hitter companies as DEC (Digital Equipment Corporation, absorbed a few years ago by Hewlett Packard) and IBM.
In those decades, computer systems were backed by “field service contracts,” usually provided through the vendor companies themselves, which provided trained technical staff to both install systems and fix them when hardware components failed. As a senior software engineer and manager, and later as a business owner, I was responsible for systems administration of engineering and software development computers — and thus for the procurement and management of field service contracts worth several tens-of-thousands of dollars per year.
As a result, I was in a great position to observe and interact with some great field service technicians, and together we got to fix some neat and challenging hardware problems and failures — usually under duress and time-crunches, with multiple system users breathing down our necks to get the system back online: “Is it up yet?”
How is this relevant to fixing a PC? Well, computer hardware architecture hasn’t really changed all that much over the decades: Modern PCs, and even laptops, are internally quite similar to the “big iron” systems that have gone before. Although individual components — CPU, memory, hard disks, etc. — have been miniaturized and commoditized, the principles of design, organization, interconnection and implementation are essentially the same. Thus, what I’d learned by observation and participation in fixing big iron systems largely translates into useful techniques and approach to solving PC hardware problems today. Sometimes not so obviously… I might have to think and puzzle over the translation of a past approach to a problem today, but invariably, what worked then is applicable today.
Okay, enough background… Let’s get down to brass tacks.
What’s Important in a System Recovery and Rebuild
Before getting into the narrative of how I recovered a PC system from a hard disk (or other hardware component) failure, let’s consider some basic assumptions, issues and observations:
- Data — it’s what’s most important on your system: The primary… no, the essential… purpose of any computer system (big iron or personal computer) is to store, preserve and manage a (typically very large) set or repertory of personal and/or business data. It’s all about the data!… Your data — not the hardware, not the application programs, not the operating system — is the only thing that’s important. Once it exists, it cannot be re-conjured out of thin air. Repeat this over and over as a personal mantra. Once you “get it,” figure out how to take all steps necessary to protect and preserve your data at any and all costs.
- Data volume: Increasingly, the size (amount, storage requirements) of your own personal and/or business data vastly overshadows the storage requirements of your operating system and all applications software combined. Whether it’s photographs, music files, databases, documents, your whole multi-megabyte archive of carefully stashed and tagged emails, even your meticulously-tweaked application configuration files (even more megabytes tucked away in “hidden” files and folders), and/or whatever else, it’s not uncommon today to have personal and business data volumes (size) of several tens- or hundreds-of-gigabytes, per system (PC or laptop), compared with a measly few hundred-megabytes of operating system and applications files combined — the ratio is easily over 100-to-1, and it’s only going to grow. Presuming that you want it all back after your hard drive dies spectacularly, this data volume imposes considerations and requirements for data backups that did not exist even a couple of years ago… and just trying to copy your files to CD-R (maximum capacity of about 750MB) or to DVD+R (max of about 4.4GB) is simply unfeasible, unmanageable… impossible.
- Backups: Once you “get it” that your data is the only thing you have to worry about — and that you’ve got a huge amount to cope with — then all you’ve got to do is figure out how to design, procure and implement the best multi-resource backup scheme that you can afford for your household and/or business. Multi-resource?… This just means “Don’t put all your eggs in one basket”… have more than one backup resource available to you, as a disaster situation can benefit from, or even demand, data recovery from more than one resource. What this usually boils down to is a decision to do backups both locally (near-at-hand storage, selectively) and remotely (in-the-cloud backups as a service you can buy; e.g., CrashPlan, Mozy, Carbonite, Dropbox, etc., for all of your gigabytes). I’ll mention and recommend my own personal favorites in both local and remote backup resources and approaches later in this series.
- The OS is not important: What I’m documenting here in this series is not operating system dependent… and it’s certainly not Linux-centric. In fact, the considerations and approach herein is applicable not only to Linux (all flavors, relatives and distros), but to Windows and even to Macs. And although Microsoft’s current OS/software licensing requirements can impose practical difficulties in and challenges to system restorations (binding the OS too tightly to a particular hardware configuration, erroneously presuming that nothing will ever change or fail), everything here applies at least in principle to Win and Mac systems — but it’s certainly easier with Linux! Specifically, the easiest part of a PC rebuild is (or should be) the re-installation of the operating system and your preferred applications software.
- Hardware is the cheap commodity: In the “good old days” of big iron, we’d spend thousands, or even tens-of-thousands, of dollars for field service contracts, including replacement components and the technical expertise to diagnose and repair systems. Fortunately, that’s all in the past — today, most system/PC component failures can be replaced/repaired for tens-of-bucks (per instance). In particular, a failed hard disk can be replaced for around $80-to-$120, and usually the replacement disk is much larger than the one it’s replacing… It’s often feasible to upgrade your HD-storage capacity, say from 250GB to 2TB, as a byproduct of the hardware repair, and often for less money than the original disk cost you!
- System Logbook: From decades of managing big-iron systems, I’ve learned — indeed, internalized — the critical importance of maintaining an external, written System Logbook… And yes, it’s important that this be “outside” of your PC. Something like a college composition notebook (cheaply available at Staples, Office Depot, etc.) works great. In it, record all relevant system events, with a date/time stamp: The system’s initial hardware and software configuration, important events (what happened), what you installed (new or replacement hardware; software products/packages including names and versions), what seems to be operating funky or degrading, what actually failed, what you’ve fixed, and the next/current configurations. Your Logbook becomes an invaluable resource especially when you’re doing a system rebuild, and it may be your only source for recalling the fiddly details of your system’s overall state and configuration. The only time that this Logbook becomes a waste of your time is if your house burns down (which has happened to hundreds of homeowners in the Western U.S. this past summer)… In that extreme case, you’ll be rebuilding systems (and your life) from your own personal memory. In all other cases, you’ll be glad you’ve created a written record of your systems’ configurations close at hand — your Logbook.
- You can do it yourself (mostly): At the scale of your SOHO (small business office, home office or home) installation, you… yes, you!… can do a full system recovery, including hardware component replacement repairs, all software reinstallation, and personal/business data restoration, yourself (much to the dismay of certain predatory “home PC repair” outfits), iff (that’s if and only if): a) You are reasonably technically competent and prepared, and possess a decent understanding of basic hardware and software troubleshooting, and you enjoy solving technical problems (still tinker under the hood of your car? fix the lawnmower? install your own stereo hi-fi system? …fixing your own PC is likely as easy, if not easier); b) You plan and prepare ahead of the certain equipment disaster that will occur when you least expect or need it; c) When it does occur, you sit down to plan (in writing) your own sequence of recovery steps in light of your detailed understanding of what’s failed and needs repair… essentially, a checklist; d) You are methodical and careful in executing your recovery plan, and you document your steps as you go (so that you’re even better prepared for the next disaster that looms in the future). If you actually feel under-prepared for all of this (especially regarding point a above), you can always find competent and reasonably priced help — for example, ask your geek-oriented grandchild for help (“geek” is no longer a pejorative adjective). In other words, there are a whole variety of self-help or inexpensive things you can do before hiring $50/hour (and up) “professional services” to restore your PC to working state.
Okay, the above is a lot of words — let’s summarize:
- You can do this yourself: you can recover, restore and rebuild your own PC.
- It’s all about your data.
- Backups are essential — and must be made (and verified… and practice-restored) before you need them to rebuild and recover from a disaster.
- Hardware is cheap — the rule-of-thumb is to budget to replace a failed component (hard drive, memory, etc.) when (not if) it fails, and to replace it with something bigger (more gigabytes).
- The OS and the application software are the easy part — you can always restore these onto new hardware from distribution media (CDROMs, online distros, etc.). Only rarely should you include OS or application programs in your own backups, and then only as a last resort if original distribution media is no longer available.
- Keep a System Logbook. It’s invaluable in a recovery/rebuild situation. And it must be off-line, manual, external and separate from your systems.
- Remember: When a PC/system component (a disk drive, RAM, CPU, or a functional component on the motherboard) fails, the goal is to simply replace the failed component — this does not mean replacing the entire PC! (Sorry, Dell, HP, Apple, et al… I know that this advice screws up your entire marketing plans.)
Next post: The blow-by-blow narrative of how I rebuilt my primary Linux PC…