[sf-lug] filesystem for a 3TB external USB drive

Ian Sidle ian at iansidle.com
Mon Jan 2 15:49:13 PST 2012


>      With respect to power supplies, I once encountered a 
>  drive that had suffered a power supply that failed then 
>  recovered then failed then recovered... multiple times in 
>  a few seconds. The lost+found directories for the file 
>  systems were bizarre. I could not recover the data 
>  (someone more expert than I managed to recover with the 
>  help of exotic tools). 

Indeed, I've had similar incidents myself. I've also seen a few cases where the power supply failed and it spiked the voltage which then damaged the controller on the hard disk, preventing it from spinning.  

From the forms that I have read about people using ZFS and it's error-correcting capability, it becomes rather apparent when there is hardware problems (bad disk controller, bad memory, etc) because it is able to detect data inconsistency  while traditionally those errors in the data were silently processed and saved back to disk.  

Ironically, this in a way increases the hardware requirements to use ZFS "properly" because you want to use ECC memory, which now requires a Server/workstation system rather then a mere "desktop" without ecc memory.  Otherwise, there is the possibility that an error might sneak into the RAM, which would get passed to disk and then when the information is pulled back up an parity error is detected. 

The primary reason for ECC is because of background radiation (especially alpha particles), which has a probability to hit the memory chip in just the way, to cause a bit of information to flip from 1/0, which could crash your computer (or screw up your data) *if* it just happens to land at just the right spot at the right time. 

Sadly, there isn't much information in published research studies, so one is left with small scale studies and marketing speak at best, with a large amount of form postings based on personal experience. The big study most people reference is the google study[1] that makes some interesting comments and awakened some people, but the numbers were not collected in a "laboratory" environment so the academics don't trust their numbers very much.

The "experts" say the probability of a bit error happening in RAM is somewhere around once a month to once an hour, depending on the volume of ram, the environment it is in and the person who is giving the statistic (and how much $$ they are likely to get from sales).  I suspect reality is somewhere between "it happens more often then most people realize" but less then "We can't trust any data on any computer to ever be accurate". 

I've always wondered why it never because the default, since I don't think it costs that much more to add it (as is VERY common in embedded device) but there wasn't much demand for it from personal computers, so they didn't get that much volume from it.  

Amusingly, Microsoft at one point even DEMANDED that all computers that bear the "Vista Compatible" logo be using ECC memory, as they insist it is the leading cause of blue screens. [IMHO, I could believe /some/ were caused by, but I don't think it bears all of the blame...] In the end, most of the manufacturers told M$ to screw it by shipping non-vista-certified boxes running vista anyway and they eventually backed down. 

However, now that 8 and 16GB ram is becoming the norm for desktop computers, we might hit the wall and have to add ECC memory because the size of memory bits on the ram chips are so small now that background radiation/alpha particles might flip bits in memory much more frequently then they had in the past.   Although it has been claimed this would happen a few times before, and somehow they keep finding a way to work around it for a little bit longer. 

The bottom line is - if it's a server (or at least has mission critical data), then you want to go out of your way to make sure you have good components and ECC memory. Thankfully, most brand-name production servers have had ECC for years (and some "workstations") but you always ended up paying much more for it, but it's worth it IMHO. If your running a counter-strike server, then data corruption isn't a big deal, but if this is where you keep your accounting information where having even one number be off could be a big problem.

thanks,
Ian

[1]http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf



More information about the sf-lug mailing list