Neat! I'm a bit confused as to why the author is using tokio though if they're doing blocking I/O; unless I'm missing something surely a simple threadpool would be better?
tokio::tasks effectively gives him a threadpool. I'm not sure he could have gotten any faster by writing his own multiplexing thread pool implementation or using a different crate that implemented one.
I agree with the sibling comment that rayon (and work stealing/fork-join) might be a more efficient choice.
As it stands, each thread must operate on a semaphore resource to limit concurrency. If this was limited directly via the number of threads, each thread could perform operations without trampling on each other. (I'm not sure if this is the intent, but the program in the post might create an unbounded number of tasks/threads, since the semaphore is only scoped to a single invocation of calculate_size and isn't used in recursive calls) In addition, each call to spawn_blocking appears to involve inter-thread communication. Work-stealing would allow each thread to create and consume work without communication.
I like the sharded hash-set approach, but I think the implementation may have some inefficiencies. First, the shard is chosen by inode % SHARD_COUNT. Inodes tend to be sequential, so while this choice of shard will distribute requests across multiple hash sets, it will also increase contention, since a single directory will have its inodes distributed across every hash set. I wonder if (inode >> K) % SHARD_COUNT might therefore lead to better performance for some values of K. Especially if each thread batched its requests to each hash set.
In addition, tokio::sync::Mutex is quite a large structure! parking_lot::Mutex is, by comparison, only a single byte. avoiding tokio and using threads, would allow the use of the smaller Mutex which would keep each shard to a single cache line. (You might still lose that space to cache padding on machines that require 128-bytes for cache padding to prevent false sharing, but at least those bytes wouldn't be accessed.)
Regardless, I thought the article was neat! And TIL about getattrlistbulk
Wow! That's really neat! I'm glad the feedback was helpful.
Looking through the most recent source of the program a few thoughts:
Each access of SEEN_INODES requires an atomic operation (separate from the mutex operation), because OnceLock uses an atomic internally to coordinate between threads. As a result, it's going to be more efficient to say let seen_inodes = &*SEEN_INODES; in your code, and then access the state through that reference.
Now that a single thread is less-likely to be accessing multiple HashSets, I think it should be possible to avoid having one lock/unlock operation per directory. Instead of invoking check_and_add_inode(), consider something like:
let mut all_inodes: Vec<(u64, u64)> = ...; // (inode, file size)
// Sorting the inodes will keep shards together, so it'll reduce the number of
// locking operations we need.
// TODO: If we sort by shard, then _every_ thread will always try to access the
// lowest shard first. It might increase performance if we use a different sort
// order per thread. For example, using shard_for_inode(inode) ^ PER_THREAD_VALUE
// or something like that as the sort key will keep each thread accessing the
// shards in a random order. While a _better_ hash function might make shuffle
// the access order better, I'm not sure that the perf gain will be so high
// that it'd be worth it to do something computationally intensive.
//
// One option (in part, because the Apple Silicion cacheline size is 128 bytes),
// would be to have a per-thread (though, like with the note about SEEN_INODES,
// you might want to avoid going through a thread_local variable for each access
// ) [u8; SHARD_COUNT]. On initialization, it could be filled with 0..128, and
// then shuffled. Then the sort key could be:
// thread_local_permutation[shared_for_inode(inode)] as usize
all_inodes.sort_unstable_by_key(|(inode, _)| shard_for_inode(inode));
let mut lock = None; // stores Option<(shard index, mutex guard for shard)>
let mut size_acu = 0;
for (inode, size) in all_inodes {
let inode_shard = shard_for_inode(inode);
// If the lock from the previous iteration was from the same shard, reuse it!
if let Some((lock_shard, _)) = lock.as_ref() && lock_shard != inode_shard {
lock = None;
}
let (_, shard_lock) =
lock.get_or_insert_with(|| (inode_shard, SEEN_INODES[inode_shard].lock()));
if shard_lock.insert(inode) {
size_acu += size;
}
}
// Drop this after the loop is over to avoid holding the lock longer than needed
std::mem::drop(lock);
Instead of passing strings around, it should be more efficient to use byte-slices. Every conversion from a byte-slice to a String involves checking that the byte-slice is valid UTF-8, and I don't think that should be necessary for most of the filesystem manipulations.
Path::join needs to allocate memory for the full path joined path for each subdirectory. It should be possible to eliminate that by using openat() instead of open(). (This may also be faster to process in the kernel as well.) openat() takes a file descriptor of a base directory to reference. So, rather than doing open("/foo"); open("/foo/bar"); with openat(), it could be let foo = openat(AT_FDCWD, "/foo"); let bar = openat(foo, "bar").
You might want to experiment with using different memory allocators instead of the system allocator. I frequently use mimalloc.
Each call to get_dir_info() ends up allocating memory several times. Consider having get_dir_info() instead take a &mut DirInfo which get_dir_info() can .clear() before using it. The caller can use then keep reusing the allocations from a single DirInfo, instead of the vectors being freed and re-allocated for each iteration.
get_dir_info() is also allocating Strings for each child entry that it encounters. It's possible to avoid this allocation, too. Instead of copying the contents of the names into a fresh String, DirInfo's subdirs field could be a Vec<&CStr> where it referneces C-strings that live directly in attrbuf. Actually implementing this will likely cause some gnarly lifetime issues, but removing this allocation as well should, at least in principle, be possible.
Lastly, instead of invoking libc::__error(), there's a handy Rust function: std::io::Error::last_os_error() that should be easier to use.
(I hope this isn't too much feedback, and that you find this useful :) )
Nice article, I love a nice performance optimization!
Since this post prominently features Go profiling on macOS, I want to warn that ITIMER_PROF, the timer that backs Go's CPU profiles, is unfortunately broken on macOS in a way that can overcount time spent in system calls.
I'm not sure if this application is affected; clearly we expect most of the time to be spent in system calls. Plus, the final results simply use end-to-end wall time, so clearly those are valid. This is just something that unfortunately has to be kept in mind when profiling on macOS.
Neat! I'm a bit confused as to why the author is using tokio though if they're doing blocking I/O; unless I'm missing something surely a simple threadpool would be better?
tokio::tasks effectively gives him a threadpool. I'm not sure he could have gotten any faster by writing his own multiplexing thread pool implementation or using a different crate that implemented one.
I agree with the sibling comment that rayon (and work stealing/fork-join) might be a more efficient choice.
As it stands, each thread must operate on a semaphore resource to limit concurrency. If this was limited directly via the number of threads, each thread could perform operations without trampling on each other. (I'm not sure if this is the intent, but the program in the post might create an unbounded number of tasks/threads, since the semaphore is only scoped to a single invocation of
calculate_sizeand isn't used in recursive calls) In addition, each call tospawn_blockingappears to involve inter-thread communication. Work-stealing would allow each thread to create and consume work without communication.I like the sharded hash-set approach, but I think the implementation may have some inefficiencies. First, the shard is chosen by
inode % SHARD_COUNT. Inodes tend to be sequential, so while this choice of shard will distribute requests across multiple hash sets, it will also increase contention, since a single directory will have its inodes distributed across every hash set. I wonder if(inode >> K) % SHARD_COUNTmight therefore lead to better performance for some values ofK. Especially if each thread batched its requests to each hash set.In addition,
tokio::sync::Mutexis quite a large structure!parking_lot::Mutexis, by comparison, only a single byte. avoiding tokio and using threads, would allow the use of the smaller Mutex which would keep each shard to a single cache line. (You might still lose that space to cache padding on machines that require 128-bytes for cache padding to prevent false sharing, but at least those bytes wouldn't be accessed.)Regardless, I thought the article was neat! And TIL about
getattrlistbulkThanks @eta and @YogurtGuy for the feedback here! I was able to increase performance by ~28% by implementing your suggestions. I've written a follow-up post based on this feedback: https://healeycodes.com/optimizing-my-disk-usage-program
Wow! That's really neat! I'm glad the feedback was helpful.
Looking through the most recent source of the program a few thoughts:
Each access of
SEEN_INODESrequires an atomic operation (separate from the mutex operation), becauseOnceLockuses an atomic internally to coordinate between threads. As a result, it's going to be more efficient to saylet seen_inodes = &*SEEN_INODES;in your code, and then access the state through that reference.Now that a single thread is less-likely to be accessing multiple
HashSets, I think it should be possible to avoid having one lock/unlock operation per directory. Instead of invokingcheck_and_add_inode(), consider something like:Instead of passing strings around, it should be more efficient to use byte-slices. Every conversion from a byte-slice to a String involves checking that the byte-slice is valid UTF-8, and I don't think that should be necessary for most of the filesystem manipulations.
Path::joinneeds to allocate memory for the full path joined path for each subdirectory. It should be possible to eliminate that by usingopenat()instead ofopen(). (This may also be faster to process in the kernel as well.)openat()takes a file descriptor of a base directory to reference. So, rather than doingopen("/foo"); open("/foo/bar");withopenat(), it could belet foo = openat(AT_FDCWD, "/foo"); let bar = openat(foo, "bar").You might want to experiment with using different memory allocators instead of the system allocator. I frequently use mimalloc.
Each call to
get_dir_info()ends up allocating memory several times. Consider havingget_dir_info()instead take a&mut DirInfowhichget_dir_info()can.clear()before using it. The caller can use then keep reusing the allocations from a singleDirInfo, instead of the vectors being freed and re-allocated for each iteration.get_dir_info()is also allocatingStrings for each child entry that it encounters. It's possible to avoid this allocation, too. Instead of copying the contents of the names into a freshString,DirInfo'ssubdirsfield could be aVec<&CStr>where it referneces C-strings that live directly inattrbuf. Actually implementing this will likely cause some gnarly lifetime issues, but removing this allocation as well should, at least in principle, be possible.Lastly, instead of invoking
libc::__error(), there's a handy Rust function:std::io::Error::last_os_error()that should be easier to use.(I hope this isn't too much feedback, and that you find this useful :) )
Awesome! Thanks for more ideas here :)
rayon maybe. But tokio is a great choice.
Dumac and Jalad at Tenagra.
Nice article, I love a nice performance optimization!
Since this post prominently features Go profiling on macOS, I want to warn that
ITIMER_PROF, the timer that backs Go's CPU profiles, is unfortunately broken on macOS in a way that can overcount time spent in system calls.I'm not sure if this application is affected; clearly we expect most of the time to be spent in system calls. Plus, the final results simply use end-to-end wall time, so clearly those are valid. This is just something that unfortunately has to be kept in mind when profiling on macOS.