ZFS is a very interesting filesystem. I never actually used it, just made a few experiments with it so far. Mainly, because there is only sort of experimental support on Linux, no support on Mac OS X at the moment (as far as I read, the project which aimed to do this was frozen, but maybe zfs-fuse can be ported to macfuse), and also absolutely no support on Windows (at least until somebody finally writes a proper winfuse-binding or ports it to Dokan).
Still, the first-choice-possibility of accessing a ZFS-Pool through any other OS than Solaris seems to be just installing OpenSolaris on a VM, and just forwarding the desired blockdevices. I think this is more secure and not really slower than using FUSE, at least it is more portable at the moment.
The reason why I am interested in ZFS is that it serves a few purposes I really like. To understand why, one has to know my general behaviour of managing my files.
Actually, mostly times of having too much free time tend to alternate with times with almost no free time. While having no free time, I am just working with my computers. I dont explicitly manage my files – I just save them somewhere I can find them again. And sometimes I need some old file from some old archive or directory, which is on an external disk, thus, I just copy these files (or extract the archive) to my current harddisk. And leave them there.
When getting more free time again (and less free space on my disk) I tend to “tidy up” my filesystem. But then, often I changed some old files. Or lost the overview over them. Or simply want to set up my system from scratch because there is a lot of crap running around on it. Mostly then, therefore, I just copy the whole home-directory (and maybe others) onto my external disk – thinking “setting up my system is more important, I can tidy up my files later”
… Now guess what happens …
Of course, I have whole system-backups from years ago, even some from the times when I used Windows. And sometimes I have System-Backups of systems which contain copies of System-Backups. Sorting them would take a lot of time. Sometimes I grub through the old files like an old photo album. I dont want to change these files. I dont want to delete these files. And actually, I am much too lazy to sort them.
So of course, I have the need of more and more space. This is no problem. But also, since so many files have duplicates, the need for space increases exponentially. Well, there are tools like fdupes. But fdupes takes a long time to index the files, and when I change a file afterwards (accidentally, etc.), this affects all other files. And fdupes works only on systems with symlinks or hardlinks. And fdupes cannot shorten files which are only partially the same.
On the other hand, there are a lot of well-engineered backup-tools like rsync with a lot of additional features, and in every production-environment, I would recommend a strict backup-policy basing on these, anyway. But at home, I have a lot of old computers running – sometimes just for experiments. I have no production system. I just have a creative chaos – and I actually want to keep this creative chaos. At the time of desktop-virtualisation, when it is no problem to run three operating systems on one single cpu-core at once, at the time of ramdisks, at the time of WebGL, I simply dont want to manage my files manually, when the filesystem just could deduplicate equal blocks, so I could have hundreds of copies of the same system-snapshot without really having to waste a lot of space.
And of course, I want to just be able to add another external disk to my pool, so I can save files on both of the disks without having to copy them, but – if possible – have anything on both disks when attatching them (or at least being able to just remove one of the disks when all the others are attatched). As far as I know, this can be done with ZFS. And a lot of other weird stuff. The only problem is that there appears not to be any good GUI for it, and no good Desktop-Integration either. And – well – it is only really supported by Solaris. The FUSE-Modules will first have to prove that they are really stable.
So – well, ZFS seems to fit my “wishes”, just would have to port it to Mac OS X and Windows (which is sure a lot of work but shouldnt be too hard either). But well, ZFS is a complicated system. And complicated systems can have a lot of complicated bugs. And the filesystem is the thing anything is saved on. That is, the whole computer can be damaged – as long as your filesystem is ok, you wont lose your data. On the other hand, a damaged filesystem can make it slightly impossible (especially when you encrypted it, for example) to restore any of your data, even though the rest of your computer (including your harddrive which just did what the buggy kernel module told it) works totally well.
Modern filesystems duplicate the necessary parts to restore data on the disk and have policies which make it less likely that data gets damaged. This is a good thing, but obviously, this doesnt help against bugs in your filesystem driver. What helps – at least to a certain degree – would be a formal verification of the sourcecode. I can think of several approaches to gain this.
The easiest way should be to ignore the hardware-problems of data integrity, and thus assuming that the disk one writes on is always correct. Then, the main thing one would want to have is some functions read : name → blob → blob and write : name → blob → blob → blob, such that (write a b (write c d e)) = (write c d (write a b e)) ⇐ a≠b, i.e. it doesnt matter in what order different files are written, (read a (write a b c)) = c, i.e. reading a file that was written before returns the same value that was written before, and (write a b (write a d e)) = (write a b e), i.e. a file’s content is the last content written into it. Maybe there should also be some defined behaviour when a file doesnt exist – i.e. instead of a blob, use some object from boole×blob, encoding whether the reading was successfull.
Of course, maybe we also want to verify that if some hardware fails, at least the probability of the data being read correctly is maximal – or at least doesnt fall below some certain probability. This seems a lot more complicated to axiomatize I think. While having an abstract “blob” above, I think you will need to specify what a “blob” is more exactly, that is, define blob as ptr→byte (whatever ptr and byte are exactly). And then, maybe defining an “insecure blob” iblob=ptr→byte×probability. You would have to specify the probability of success of reading a byte correctly for your disk – a task which must be left to the engineers building the disks and busses – and for example in a raid-system, you can prove that your way of reading and checksumming the data maximizes the probability of correct data. On the other hand, I assume its comparably complicated to calculate this probability – i.e. when you save a block of data with a checksum twice on the same disk, the probability that both of them get corrupted by a damaged sector on the disk could be smaller than the probability that one of them gets corrupted (but needs not, since they could be written at a physically near place on the disk such that the source of corruption affects both of them), but the probability that the data gets lost because of a headcrash doesnt decrease, since both of them will be affected by this – while this probability that the same data saved on different disks gets lost should be smaller.
After axiomatizing all of that stuff, one of course has to implement a filesystem that has all those features like deduplicating, snapshotting, adding additional disks, etc. – and verify it.
As far as I see, this would be a lot of work to do. But on the other hand, filesystems are not something new. There should be a lot of experience with data integrity and backups. With all this experience, shouldnt it be possible to produce such a thing? I mean, its the filesystem – sort of the most essential thing. If it fails, anything else will fail.