David Bakin’s programming blog.


Filesystem as Database

Are there circumstances where you can use a filesystem as a database, and if so, what are the implementation considerations, including pros and cons?

Suppose you have a large set of data which is all big blobs.

  • For example, the blocks of a blockchain.

And suppose you have one (or more) unique primary key(s) for these blobs.

  • E.g., block hash. And (if you’re storing the main chain only1) block height.

In that case you can consider using a file system as your database: put the blobs in files, and use directories as your indexes.

(The following is written in outline form for the first draft, and needs filling out as prose with examples and especially pointers to code. Also some performance numbers (e.g., space, time to traverse, etc.). Take it as a draft, please.)

How:

  • Directory structure - full path - supports indexing.

  • Hard links allow multiple indexes (but loses unique key == full path, so you might need to be able to identify all primary keys from content of the blob/file).

  • You might quickly run into a files per directory performance limitation for various tools (or even the file system itself); solve by splitting the key into segments and forming a hierarchy from that. (IOW, the directory structure is a radix tree with the blobs as leaves.)

    • E.g., 202b314072e33f75ab2a8d8e00762204bd3bfa01a436791fa919944a638a55e8 to 2/02b/314/072/e33/f75/ab2/a8d/8e0/076/220/4bd/3bf/a01/a43/679/1fa/919/944/a63/8a5/5e8.tx, with a maximum of 4096 entries per directory (splitting a hex number into more “natural” 4 character segments gives 65536 entries per directory, max, which is pushing it a bit for some tools).
  • For write-once databases make sure to set the read-only attribute on each file/blob.

  • Write simple scripts (perhaps as simple as a tool pipeline) to do various kinds of verification on the database/directory tree.

    • Confirm simple consistency by making sure the number of files in each tree is the same and that each file has exactly as many hard links to it as indexes you’ve created.
    • Confirm each blob is what it says it is by traversing the directory tree along each index, extracting from each blob the key corresponding to that index, and validating that it matches its path.
      • In case of multiple indexes you might have to find out if it is better to walk one tree, opening each blob to get the corresponding key, and then after that do the other indexes (each blob would then be opened and inspected as many times as there are indexes - though maybe not all of the blob needs to be read!) , or, traverse one tree, out of the blob get all the index keys, and then do directory searches on the other indexes.
        • Latter is likely to be much better because the total size of the directories for all indexes is likely to be much smaller than the total size of all the blobs (or even the size of the part of all the blobs that is read to get the keys) and so is much better for the OS filesystem cache.

Pros:

  • Easy to process with ordinary tools - unix utilities (especially in pipelines), shell/python/ruby scripts.
  • Easy to process programmatically as every language has file system traversal, including file path existence check, and read/write of files built-in, so its familiar to everyone. Also (depending on your use case) every language has libraries for doing all kind of typical things with files. And, finally, it is trivial to expose a file system (especially read-only) to remote computers either as simply a file share or as a web service. (Though, in that case, performance suffers quite a bit for doing any kind of bulk operation.)
    • For more extensive/sophisticated work: The filesystem is designed for concurrency (not atomicity of the blobs themselves, nor of their relation to other blobs, but of the indexes) and this works for read-write databases as well as read-only - though of course, better for read-only. Several programs can be writing different rows simultaneously, and several more reading arbitrarily. In addition, it is easy to set up “watchers” so utility programs can get notified when a blob has been added and so can process database row additions immediately. Finally, the blobs are timestamped, and some uses of them may want that.
  • As robust as the underlying file system, plus can gain some amount of additional “protection” by setting read-only attribute on files (especially for write-once databases)
  • Easy backup as a file system (with or without snapshots, depending on volatility of your data, using timestamps if necessary) because you can incrementally back up, smaller chunks to back up, and individual data times treated atomically - use ordinary tools, including archivers (e.g., tar), and easy-to-write scripts, as opposed to needing special software to handle the fact that a database file can be extremely large and the contents are opaque to anything but a specially written backup program.,
  • You can use whatever format you’d like for the blobs, so it is possible to pick something easy to process with simple tools/utilities/scripts.
    • In fact, can have “raw” format (e.g., binary blocks) side-by-side with “interpreted” format (e.g. json).
  • Add additional “columns” of metadata by adding files side-by-side with different extensions.
    • Metadata can be “bunched” e.g., a json file with a bunch of fields, or maybe better (for use with command line tools) ini format or some other line-at-a-time key-value format.

Cons:

  • Need to prevent slowdowns due to implementation limits - typically, due to “too many” files/directory, or even too many files in total, due to the file system itself (less often these days) or due to flaws in common utilities you’d use.
  • Less flexible w.r.t. indexes
  • Very simple single table schemas only - doesn’t work well with “foreign keys” to other “tables” (unless everything is write-once).
  • Doesn’t have consistency support, nor atomic change support (at least not with standard file systems or utilities).
    • Nor atomic backup, if needed (but not likely needed with write-once databases).
  • Need a slight bit of care to handle backup/restore with hard links (for multiple indexes).
  • You may be easily able to support the operational and administrative overhead of a database and prefer that to handling a very full tree of files in a filesystem.
  • Database might be higher performance for some use cases. Not for accessing a single blob (or a few) but for queries on keys (filesystem traversal to get paths, then assemble them back from hierarchical fragments to keys, might very well be slower than a database index scan).
  • If using more than one directory hierarchy, with hard links, for multiple indexes, index management on insert/delete is up-to-you and not managed by database constraints.

Not a concern:

  • Queries that require a full scan of blob contents - you’ll be limited by disk speed in either the filesystem case or the database case.

  • Space overhead due to “directories” not an issue since the individual “rows” (files) are large (and anyway a database index takes space too).

Possible variant implementation:

Let it be directories all the way down to the end of the primary key. Leaf directories (the most specific, representing the entire primary key) then contains an arbitrary collection of files for that key. Presumably several large blobs. Subsequent indexes also go all the way down and their leaf directories also contain hard-links to the set of files for that item. (By definition: the same set.)

  • In this alternative the number of directories explodes, since each blob (set) in the database has its own individual directory. In addition to the space taken in the filesystem for those directories (perhaps no longer trivial compared to the data) the number of directories may start to strain some tools, or even the operating system (as they’ll need to be read during a traversal and at least occupy space in the disk cache for some small time). That still may be worth paying for some use cases, but generally I think that’s where you abandon this for a different scheme.

  1. Instances of two (or more!) blocks being mined simultaneously are rare these days on Bitcoin, and don’t last long if they happen. (In fact, Bitcoin hasn’t seen one since June 2017.) On the other hand, the longest forks of this nature have been due to bugs in released Bitcoin Core software. Which could be discovered and/or exploited at any time. So we can consider this a method for storage of well-confirmed blocks only, where you can decide what well-confirmed means: special-case the top … 10? 50? blocks and you’ll be fine. ↩︎