In preparation for the end of this phase of my life and the start of the next, I've been digging through the files on the computer I mothballed several years ago when I bought my present one - looking at organizing them and archiving so that my old projects won't be lost. I've found that the hard drive on that machine has developed some bad blocks, and that reminded me of an idea I once had for an holographic filesystem. I've also been painfully reminded that I used to do a lot of cool things and I haven't done very many lately, so that it's time to start some interesting new projects, and this might be a good start for one.
The conventional wisdom is that if a drive has user-visible bad blocks, that means that it has run out of its internal spares (with which it would normally remap bad blocks invisibly to the user) and that means it's on its last legs and you should discard it. That's great. Good idea! You didn't care about actually keeping your data, and you can easily afford to drop a few hundred bucks on a moment's notice, right? Even with backups it sucks to throw out a drive you have in current use; but it's worse if that drive (or CD-R or DVD-R or Heaven forbid Jaz disc) was the backup. There's some need for a filesystem that works on devices that not only have bad blocks, but develop bad blocks between when the data goes onto the device and when you try to take it off.
So here's a fantasy: Let's say I have a 4G device. Maybe it's a DVD-RAM disc. I format it and write 1G of data onto it. Behind the scenes the filesystem is making copies of the data in a RAID-like way, so that each block of my data is actually written to four blocks on the disc, in different locations. If a block goes bad, the filesystem will know, and use one of the other copies. If I write another gig of data onto the same disc it'll automatically cannibalize some of the backups, so that then I have two copies of 2G of data.
There's a hard limit when I have 4G of data on the disc and there's no redundancy left. But as I delete files, their space is reclaimed to make more backups of the remaining files so the safety margin retuns. The bottom line is that all the blocks on the disc are used to store all the data, with some amount of redundancy determined by how much data you actually store. If you only run the discs at a fraction of their capacity, then you can be solidly protected from the type of deterioration that causes individual bad blocks; you only have to worry about the type of deterioration that causes the entire disc or drive to go.
The main use I see for something like this would be backup and archive discs. I already do something like it manually, both disc to disc and within a disc. When I make a backup set I normally burn a set of CD-R or DVD-R discs that contain selected fractions of my desktop computer's file tree, and really important files will be included on several different discs so that there's less chance of losing one.
Within a single disc, if as often happens the data I want to put on the disc amounts to much less than the total capacity, I'll often make a couple of subdirectories and repeat the files in each one to fill up the space. If I'm burning a disc anyway I might as well fill it up because the blank space is already paid for, and that way if a few blocks go bad, I have a better chance of still recovering all the files.
Having special filesystem support would make it a lot easier to do that kind of thing - at the cost, of course that anyone reading the discs has to have support for the special filesystem. Lack of support may be an issue worth considering when the discs are going into long-term storage. I'd want to include a few discs in each set in a more popular format containing the source code for the software.
Maybe people would want to use it for general read-write purposes too. I imagine that all the copying and error checking and correction steps would consume a fair bit of CPU power, but that might be worth it for extra reliability. It might be interesting for people who put filesystems on flash (or maybe not - you don't want to be multiplying writes and wearing out the bits) or other weird devices. It might be interesting for embedded systems that have to be exposed to radiation or other adverse conditions. Note that this isn't the same thing as RAID (which I also use). RAID protects you from losing a whole device. I'm interested in protection from losing a few blocks on the device - which is the failure mode I see in practice with burned optical media, and sometimes occurs with old hard drives as well. Again, this doesn't solve the problem for which people tell you you ought to make backups and archives; this is for use on the device that is the backup or archive.
Technical stuff starts here. It seems like the easiest way to play with this idea would be to build a userspace filesystem for Linux. I think the software people use for that these days is the one called FUSE; I know others do or have existed but that one seems to be built into the kernel now. If it works well then it would be appropriate to think about making it into a standalone module for better performance. Like other filesystems it would take requests from the kernel and talk to a block device, or a loopback file in the VFS, to store the actual data. There might also be room for a mkisofs-like program to create an image you can burn onto read-only media like a CD-R. In that case it might be able to omit some data structures (I'm thinking block bitmap) that are only used during writes.
The system would be designed as two layers: one that provides a holographic block device, and one that implements a filesystem on top of it. The "implements a filesystem" layer could be closely modelled on something like ext2 (or ext3, or whatever) but you could't just make the holographic system be a real block device driver instead of a filesystem, because it needs to know when blocks start and stop being used by the upper layer, and standard filesystems don't convey that information to the block device driver. The holographic system essentially replaces the block allocation part of the upper layer filesystem. Another reason to have a custom-written upper layer would be to conveniently provide a /proc-like magic directory (probably optional with a mount option) for poking at the internals.
The holographic layer's main job is to deal with reliable and unreliable blocks. The underlying block device provides some number of unreliable blocks. The upper layer sees some number of reliable blocks - but not all of them necessarily exist at any given moment. And there's a several to several relationship between reliable blocks and unreliable blocks. It could be, for instance, that reliable blocks 1001, 1002, and 1003 are holographically stored across unreliable blocks 2003, 4005, 6007, 8009, and 9753. Three reliable blocks onto five reliable blocks. If you lose an unreliable block then it becomes three onto four, and maybe you go looking for a new unreliable block to replace it and boost your redundancy.
Actually doing the storage is pretty easy. There are algorithms for recovering from a lost block and so on; it's like doing RAID with very small drives. You maybe have to write to all five unreliable blocks every time you update any of the three reliable blocks, or read more than one of the unreliable blocks when you read one of the reliable blocks, but caching can help a lot. The holographic layer also has to keep track of the mapping from reliable to unreliable blocks, including updating it as unreliable blocks are found bad and trashed, and reliable blocks come in and out of existence.
Of course, if the holographic layer loses track of which reliable blocks map onto which unreliable blocks, then the whole thing comes crashing down around our ears, so to be safe we'll be sure to store all the mapping data in reliable blocks. Then whenever you want to look up any reliable block, you first look up the reliable block containing the mapping data for it.
I was once a TA for CS 354, the operating systems course, at the University of Waterloo, and that's the sort of system my students used to design. If you can see what's wrong with it, then you're ahead of most of them.
The problem is that we need a way to bootstrap. In order to read the data structure that stores the mapping, we need to have already read the data structure that stores the mapping. Fortunately, there's a way around this problem: we can do it by induction, which my students also have trouble with, but let's assume you're smarter. We will require that the mapping for any given reliable block must be stored in some lower-numbered reliable block. Then the only problem is determining the mapping for the first block, for which we sacrifice the user to Shub-Internet and examine the entrails. The rest follows naturally; as we read forward through the blocks we always know where the next block is even if we don't know where the last one is yet. That may make it sound like a linked list, but it isn't; since each block can give mapping data for many more, it should instead be a high-degree tree.
Let's talk about what that mapping data looks like. As I envision it, we'd have some reliable blocks that are mapping blocks, and each one specifies the mapping for the entire reliable-block space, which is probably 64 bits. Pretty good for a table that might be 4K in size. The thing is, each entry in the table specifies a range of block numbers (actually the start of a range of block numbers) and either a pointer to a mapping block covering that range, or else the actual mapping data. Actual mapping data would consist of a list of groups of unreliable blocks, the mapping from reliable blocks within the range to slots within the group, and whatever other related data needs to be kept.
So the first reliable block contains actual mapping data for the first few more blocks, and then the rest of the mapping is delegated to mapping blocks within that range. Those mapping blocks each contain actual mapping data for a few more reliable blocks, and then point both down into their own children to break it down further, and back up towards the root for blocks outside their own range. The general idea is that much like DNS servers, by looking at any mapping block you either get the mapping you want or a referral to another mapping block closer (in the sense of fewer links) to the one you want. We can build this kind of structure and have it actually work, while also obeying the "mapping for every block is in a lower-numbered block" rule, because we know in advance how many reliable blocks there can be - just as many as there are unreliable blocks in the underlying device.
Alternative: instead of having pointers, we could use math. Reliable block numbers are at our mercy. So if we know that a reliable block can store mappings for let's say 16 other reliable blocks, then we say, fine, block 0 stores the mappings for blocks 1 through 16, blocks 1 through 16 store the mappings for blocks 17 through 272, which store the mappings for blocks 273 through 4368, and so on. The progression would bottom out when we had enough mappings for all the rest of the blocks on the device, at which point we'd have reliable blocks for storing actual data.
Shub-Internet cometh in the form of a hash function. The thing is that we still need to find the mapping for the first block. We do that by hardcoding it, essentially. There is a fixed mapping for the first reliable block, something like "reliable block 0 maps to unreliable blocks 1234, 5678, and 9876". That's close to what existing filesystems do with "superblock backups." Actually, there is something cleverer we can do that allows for essentially unlimited backups. Instead of hardcoding a few locations for superblock backups, we hardcode a linear congruential hash function that describes how to look at unreliable blocks one at a time in a way that looks random but will visit each one once without repeating until it has visited them all. That gives us a long list of "places to look." For friendliness with other systems we might as well make the first one be physical block zero, so that images of our filesystem will have a consistent magic number, though we expect to still be able to work with partitions on which the first physical block has been trashed. Then we checksum the Hell out of the superblock and write it on say the first seven places to look. When we want to read the superblock, we look at places to look until we find a valid superblock.
Suppose one of the seven places-to-look goes bad. Then we write a new copy in the eighth place to look, or the ninth, or whatever until we find one that is good. Suppose the eighth place-to-look is good, but already occupied. Then, this is the clever bit: we say "Oh, good gracious me, this unreliable block has just failed!" We tell the holographic layer not to use that unreliable block anymore... and then we write our superblock backup there.
I mentioned knowing whether an unreliable block is good or bad. It seems we also need to know whether each one is occupied or not. So there should be a bitmap (two bits per block) recording that information, and that bitmap should be stored in reliable blocks somewhere too. Presumably it would be worked into a contiguous range of block numbers once enough mappings had been defined to accomodate it comfortably. On a read-only filesystem it could be omitted because we'd have no use for the data anyway.
The upper-layer filesystem could be pretty close to an existing ordinary filesystem, and ideally we could choose a good one and use it almost unmodified. The main difference would be that instead of the upper layer's native method of keeping track of which blocks are in use, we would use the one provided by the holographic layer. The upper layer has to tell the holographic layer when it wants a new free reliable block and when it's done with one. Whatever mechanism it uses for providing resiliance against bad blocks can also be removed because the holographic layer handles that now. When a reliable block becomes free the holographic layer marks it as such and that allows its group to switch to a higher-redundancy mode (fewer reliable blocks on the same unreliable blocks). If that means too much redundancy in the group, or if there are no reliable blocks left in the group, then the associated unreliable blocks can be marked free.
The holographic layer can keep track of how many groups exist in each mode: so many one for one, so many two for one, so many five for three, and so on. Losing or gaining blocks changes those statistics, and if they deviate far from the desired redundancy level, the holographic layer can start freeing or allocating blocks to change individual groups to chase the desired overall redundancy level. It might target keeping some constant fraction of unreliable blocks free to allow flexibility and quick replacement of failed blocks, but mostly it would assign unused ones wherever there's least redundancy to keep as many unreliable blocks in use as possible.
Does this already exist? Is it worth building?
Matthew Skala from 216.75.183.126 at Wed, 19 Mar 2008 12:39:20 +0000:
I thought I answered that specifically in the above, but to both questions the answer is that this is designed to protect against failure of individual blocks on a disc whereas RAID is designed to protect against failure of entire drives all at once. In the intended application - which is not all situations - single-block failure is a better model of what the discs actually do. Notwithstanding that my most recent reminder of the idea involved a hard drive, this is primarily intended for things like DVD-R discs, not for hard drives; though I imagine that some people would run it on hard drives too just because of human perversity.
The intended application is archives, backups, and removeable media used for transferring files from one system to another or across long time gaps. In that application it is at best difficult and sometimes impossible to replicate data across more than one disc. Whole discs can fail all at once, but that's not common. Hard drives do it but DVD-R discs generally don't. Much more likely is that a few blocks would go bad while the rest of the disc remains readable. It would really suck to have to throw out the entire disc - or even one file - because of a single failed block. Since it is also the case that we often have less data to put on a given disc than the maximum the disc can hold, and these discs are write-once, ordinary filesystems will leave some of the space blank, and that space is wasted. It would be nice to use the extra space for something worthwhile.
Cernael san from 82.209.147.48 at Wed, 19 Mar 2008 19:14:37 +0000:
Another thing that might be a benefit (though a very minor one, considering the intended application) is the automatic overwriting of deleted stuff; what's gone does not linger.
On a side note, I learnt a new word today; at first, I thought "contiguous" was a typo...
Deekoo from 85.211.69.66 at Tue, 25 Mar 2008 20:09:53 +0000:
I was musing over a similar project a while back, but at this point I can't remember whether I'd seen something similar under construction or just was busy cursing CDs full of bad sectors.
I'm not sure if read/write is going to be worth the hassle - read/write redundancy will increase complexity a lot, slow filesystem writes noticeably, and possibly speed disc failure. (And, if my data's important enough to slow down disc access to preserve it, I won't be eager to put said data on an experimental filesystem that risks munging it). And, for performance reasons, I wouldn't want to try to combine redundancy and journaling or redundancy and atime updates at all. For a write-once approach, a specially crafted ext2 filesystem should be easy enough to do and have the added advantage that the custom filesystem driver wouldn't always be needed to read it - you could use the existing alternate superblocks, and, in most failure conditions, pointing at an alternate superblock would do to read it.
One obvious limitation - while, at the logical level, a CD is simply a linear array of sectors, at the physical level some of the sectors are more important than others - physical damage that at the end or middle of the disc will translate into part of a file being inaccessible can render the entire disc inaccessible if it happens at the very beginning.
As for deciding what to give more redundancy to when (as will almost always happen) the space on the disc is not an integer multiple of filesystem size - Superblocks are obviously the number one priority. Directories are the next, and more important the higher in the hierarchy they are - I'd probably go for a full copy of the root directory per superblock, then try to duplicate the rest of the directory structure (exception: empty directories don't need redundancy much, but I think ext2 may already optimize them out of having their own block.), then worry about files. If it's being done at the driver level, ranking files probably shouldn't take anything into account other than file size and other inode data and (particularly when fewer than two copies are made) favouring completeness over having large numbers of partial backups of individual files; if the filesystem's being made by a utility along the lines of mkisofs, then having it try to deduce a file's importance and resilience could be worthwhile.
Matthew Skala from 129.97.79.144 at Tue, 25 Mar 2008 21:26:19 +0000:
See the followup posting: http://ansuz.sooke.bc.ca/software/holofs-simplified.php
I actually have a first cut of the add-redundancy utility written now. Writing a recovery utility is next; that's going to have to include some kind of error-simulating option so that I can test it in a controllable way. Then, although it'll still be kind of ugly, I plan to post it for people to play with.
One important factor is that a redundant block doesn't have to back up just one data block. You can use parity, like with RAID. If k data blocks get mapped onto one parity block, then you can still recover any one of them if you lose exactly one of them. So you can meaningfully back up the entire disc while adding less than 100% redundancy. You can combine that with mapping each data block onto multiple parity blocks - so that if one parity block is unlucky and happens to be covering two lost data blocks, each of those data blocks may still be recovered from other parity blocks. It's possible to calculate the optimal tradeoff between these two things based on other assumptions, and that's what my current utility does.
The next step after that is to use what are called fountain codes - where different parity blocks back up different combinations of data blocks and you can recover things by combining multiple parity blocks. Those offer excellent performance (for instance, start with 100 data blocks, add 20, recover all 100 from *any* 105 of the resulting blocks) but they're complicated and often patented.
After looking at the coding theory stuff I don't think it's a good idea to get into complicated issues of backing up different parts of the filesystem to different amounts. You will guess wrong; you get better performance from the codes by treating the whole disc as a unit; and I think even in the read-write case you may not win much. Also, note that it's quite reasonable to expect *absolutely no* data loss (from reasonable levels of block-by-block errors) even with just a little bit of redundancy; the codes work that well; so then it's not so useful to get into which files are more important. All my files are important, dammit.
Luke Somers from 165.123.69.39 at Wed, 17 Sep 2008 20:04:32 +0000:
My solution would avoid a lot of this mess simply be to apply to the whole disk the procedure to handle the medium-scale corruption that CDs already use.
You have N blocks of data. Consider the N-degree polynomial that has the first block as its value at x = 0, then the second block as its value at x = 1, and so on. Make sure that somewhere in that data is something that indicates when to stop so that no matter what block follows the actual data it won't be interpreted as a continuation of the intended data.
Then, write in each physical block the value of that polynomial evaluated at the appropriate physical block number. If a block is known to be bad, just skip it.
And there you go. If you can identify bad blocks at read time, then there is no loss of redundancy.
You simply solve for the polynomial, and then read the blocks off by evaluating it in order, until you reach the end of your data (you will know the end of the data, as required above).
If you cannot identify bad blocks at read time just by looking at them, then things get trickier. But as long as the number of bad blocks is less than half of the number of blocks left after the straight up data write, it can all be worked out.
The especially nice part about this system is that it represents all of the data as itself, with the redundancy bit coming afterwards. The only time you need to change old data in response to a change is to compact all data toward the beginning of the drive if you remove something without filling in the space with something else.
Do not enter a fake email address. If you don't want to provide one, just leave it blank. Comments with fake email addresses will be deleted.
This form is for posting public comments to be read by other people who visit this Web site. If you have a software support question, or other material directed to the page author instead of to the general public, please send email instead.
All the data you enter, and your IP address, will be saved and displayed. Don't enter secret information. HTML is not accepted; it will be displayed as plain text. Your comment will only be added if you enter valid data in all required fields; if it isn't, use the back button and try again.
I, and I alone, reserve the right to remove postings for any reason.
Nicholas from 24.69.97.6 at Wed, 19 Mar 2008 06:43:07 +0000:
Two questions:
a) What distinguishes this from RAID?
b) Why is this better than RAID? Holographic filesystems would seem to solve the problem of data failure, but not the problems of data failure over the entire disk.