How to Scan Directories, Fast: the Tricks of aio_scandir
In this article, I want to delve into IO::AIO's aio_scandir
function, and describe which tricks it uses to scan directories fast (and why it can be used to scan directory hierarchies even faster). These optimisations have nothing to do with Perl and could be used in implementations of any language.
Now, what do I mean by that? First, scanning a directory is easy, you opendir
it, readdir
all the names, and you are done. This is quite fast, and hard to optimize. Things get a lot more interesting if you want to find which entries in a directory are subdirectories, and which aren't.
The basic solution in Perl looks something like this - basically, read all names, lstat
them and see which one are directories:
opendir my $dirfh, $path or die "$path: $!"; local $_; my (@dirs, @nondirs); while (readdir $dirfh) { /^\.\.?$/ and next; # skip . and .. lstat "$path/$_" or next; if (-d _) { push @dirs, $_; } else { push @nondirs, $_; } }
This works, but for large directories, this is significantly slower than it could be. So let's see what aio_scandir
does about this.
Trick 1: use aio_readdirx
aio_scandir
is (currently) written in Perl, but to get the list of names in the directory, it uses aio_readdirx
, which returns all names in the directors, and possibly more. aio_readdirx
(which is implemented in the lower level C library libeio) supports a number of flags:
- READDIR_DENTS
Instead of returning just names, return the name, the type, and the inode number for each entry.
- READDIR_DIRS_FIRST
Sort the list so the likely directories come first.
- READDIR_STAT_ORDER
Sort the list so that calling
lstat
on each entry in order will likely be optimal.- READDIR_FOUND_UNKNOWN
This is a result flag, and is set when there where any entry of type unknown.
The description for the aio_readdirx
in the IO::AIO manpage gives a much more verbose explanation for these.
So how are these implemented, and how are they useful?
Trick 1a: inode numbers
Basically all readdir
implementations return an inode number (which is not returned by perl's readdir
). In most UNIX filesystems, each file (and directory) has a unique inode number, and often, this inode number is a block number where the stat
information can be found, or is a good proxy for the position of this information. So if you sort by inode number and stat in order, you are likely to get a faster response than stat'ing them in other orders.
Trick 1b: type fields
Many (but not all) readdir
implementations return an additional field, the D_TYPE
field, which basically returns the type directly. If it is available, its usefulness depends on the filesystem: Many Linux filesystems fully support it. My current favourite, XFS, basically doesn't - it sets DT_DIR
on . and .., and gives everything else a DT_UNKNOWN
, which basically means "go figure!".
Detecting whether type fields are available is very hard - it's probably easiest to write an autoconf test. libeio
itself relies on _XOPEN_SOURCE
, which, of course, is very unreliable on BSDs, so uses some additional guessing.
A minor optimisation involving type fields is the READDIR_FOUND_UNKNOWN
flag - if it isn't returned by aio_readdirx
, then aio_scandir
knows that all the entries have a type, so the type is known with certainty, resulting in a quick grep for the subdirectories.
Trick 1c: sorting and guessing
From the flags and the directory information, aio_readdir
then compiles a sort order, which is then used to sort the names into a hopefully good ordering. The sort algorithm used is a variant of MSD radix sort, which is stable, easy to implement, and works very well on the often very non-random data returned by readdir
.
Type and inode sorting is straightforward, but in the absence type information, which names are likely directories? Here we need to use a heuristic, which, in thew case of libeio, are:
If a name starts with a dot, it's a likely directory (that might not be true, but dot-entries are usually rare, and usually have directories among them).
Otherwise if the name does not contain a dot, then it is a likely directory and then shorter names tend to be more likely to be directories.
Otherwise, it's unlikely to be a directory.
This works in common cases, as directories often do not have suffixes, while files often have one (blabla.txt), and other things tend to be rare.
Trick 2: directory link count
This is a common trick (also used in many find implementations), and consists of looking at the link count of the directory. On UNIX filesystems, every directory (usually with the exception of /), has one hardlink for the . entry, one hardlink for its name entry in the parent directory, and one hardlink for the parent directory (..) entry in each subdirectory.
Or in other words, the number of subdirs in a directory is usually the number of hardlinks minus 2. For example, for the /etc directory on the box where I write this article:
/etc# stat . Device: fc0ch/64524d Inode: 24117249 Links: 351 /etc# ls -la | grep ^d | wc -l 351
The ls -la
lists 351 directories (including . and ..), of which there are 349 subdirectories. And 349 + 2 equals the link count of 351.
That means if you have a full directory with a link count of 2...
# stat . Device: 2eh/46d Inode: 10379009716 Links: 2 # ls | wc -l 19457
... then we already know there are no subdirectories here, and can potentially skip 19457 stat
calls.
This trick explains why we want to sort entries into likely directories first - if aio_scandir
find a n subdirectories, and the link count is n+2, we can immediately stop searching for more subdirs, which can be substantial if you have directory with a million JPG files and a single .thumbnail subdir which gets' stat'ed first.
What about non-POSIX filesystems? These tend to have link counts of 0 or 1, and can be detected rather portably. And since the trick is commonly used, it is bound to be supported by most kernels.
Trick 2: asynchronous lstat
With the sorted list and the number of expected subdirectories, aio_scandir
now needs to lstat
them. Doing these calls concurrently has drastic advantages: By doing at least one lstat
call concurrently with the scanning function means that any necessary processing can overlap with the lstat
call, and thus, often hide in the disk I/O latency.
Doing more in parallel often allows operating systems to find more efficient paths for the read/write head (see http://en.wikipedia.org/wiki/Elevator_algorithm). And even for modern SSD disks which laugh at this, command queueing (e.g. NCQ) will benefit even those (and conventional harddisks as well).
This is a configurable parameter to aio_scandir
, the default is to do up to four asynchronous lstat
calls in parallel.
Trick 3: Don't lstat
the path/entry itself, stat path/entry/. instead!
This trick is rather more obscure, but is easy to understand: Assuming the directory entry is a directory and stat'ing the .
entry inside gives us either success, in which case the entry at least points to a directory, or an error if it isn't.
The "trick" becomes a trick because most filesystems, including ones that do not directly store type information in directories (or refuse to return this information) can take shortcuts when trying to resolve a path. That means by trying to see "inside" the entry, the filesystem can error out earlier, often without actually retrieving the stat
information.
Naturally, there are trade-offs and complications: For directories with a lot of symlinks to other directories (which are not subdirectories), this has to do a lot more work, namely resolving the symlink and stat'ing the resulting path. So for each directory we find, we need to lstat
the entry itself to see if it really is a directory.
Fortunately, directories are usually a lot less common than files - even if some hierarchies contain a lot more directories than files, it's usually just hundreds of directories, compared to thousands or millions of files in big directories, those where speed counts most.
Trick 4: fstatat
vs. lstat
A minor trick, also not accessible from pure Perl, is to use fstatat
instead of lstat
: unless you have the luxury of being able to chdir in the target directory, you normally have to call lstat
with the full path. This doesn't usually hit the disk, but for thousands or millions of calls, the overhead of resolving the path each time is visible.
Both libeio and IO::AIO abstract the xxxat ()
interface away using its working directory abstraction, so this is easy to do.
Correctness hits. Correctness hits. Correctness hits. You die.
There are, naturally, some things to watch out for - the whole scanning operation (starting with readdir
) is not atomic, so entries can appear, vanish and change at any time during the process.
While nothing can be done about this (the result can always be outdated), aio_scandir
can at least try to make some basic guarantees in the presence of activity that changes a directory.
One such guarantee is that, in the list of files returned, there really are no directories unless a file was replaced by a directory during scanning. As basic as this sounds, applying these tricks will not guarantee this, as the appearance of a new directory might cause another existing one to be overlooked when the link count heuristic is used.
To protect against this, aio_scandir
stat's the directory before it starts reading the directory, and afterwards. If link count, size, mtime and a few other values have changed, it falls back to the full scanning algorithm. In practise, you also need to consider an mtime that is in the current second as changed, as many filesystems only store timestamps at full second resolution.
Other optimisations
The above is all I can imagine and remember that aio_scandir
can take advantage of (if you know of another technique to speed up scanning, tell me). However, aio_scandir
is usually used in the context of directory hierarchy scanning, that is, recursively scanning all directories inside a directory.
In that case, more optimisations could be done - the same sort-by-inode trick used for stat'ing inside directories can be used globally, that is, it can be used to scan many directories concurrently, and decide by inode number which directory is scanned next, and also do the stat calls of all these directory scans in an optimal way. If you are interested in these extra optimisation, consider them your homework assignment :)
Unscientific Marks on Benches
How much improvement are we actually talking about, compared to a good find implementation? Well, the savings can be substantial - let's look at two cases. First, a 20TB XFS filesystem (itself a hard case, as it gives no useful D_TYPE
information) on an 8 harddisk RAID5 and about 20 million inodes.
On this filesystem, a full scan with empty cache using GNU find takes 1h19m (find /wd -printf "" >/dev/null). The treescan utility that comes with IO::AIO, which naively uses aio_scandir
, scans the same filesystem in 50 minutes, thats half an hour less, which is extremely helpful if you sit in front of it and wait for results.
Second case, let's consider SSDs. While scanning a typical SSD is hardly worth any optimisation, what about using an SSD as dm-cache for the same RAID... In the same setup as above, but with a 40GB dm-cache for the filesystem that happens to store all the stat info, aio_scandir
still wins over GNU find in terms of absolute speed:
A full scan with empty cache using GNU find takes 64.5s, while the same with treescan only needs 15.8s, over four times faster. This is almost completely due to saved stat calls.
Where treescan loses out is in terms of CPU usage - in the dm-cache case, GNU find uses 3.5s of CPU time, a tiny fraction of the absolute time, while treescan uses 17.3s - which is more than the absolute time due to the extra threads involved.