Maybe the Fastest Disk Usage Program on macOS

3

eta 8 months ago

Neat! I'm a bit confused as to why the author is using tokio though if they're doing blocking I/O; unless I'm missing something surely a simple threadpool would be better?

5

zaphar 8 months ago

tokio::tasks effectively gives him a threadpool. I'm not sure he could have gotten any faster by writing his own multiplexing thread pool implementation or using a different crate that implemented one.
1. 10
  
  YogurtGuy 8 months ago
  
  I agree with the sibling comment that rayon (and work stealing/fork-join) might be a more efficient choice.
  
  As it stands, each thread must operate on a semaphore resource to limit concurrency. If this was limited directly via the number of threads, each thread could perform operations without trampling on each other. (I'm not sure if this is the intent, but the program in the post might create an unbounded number of tasks/threads, since the semaphore is only scoped to a single invocation of calculate_size and isn't used in recursive calls) In addition, each call to spawn_blocking appears to involve inter-thread communication. Work-stealing would allow each thread to create and consume work without communication.
  
  I like the sharded hash-set approach, but I think the implementation may have some inefficiencies. First, the shard is chosen by inode % SHARD_COUNT. Inodes tend to be sequential, so while this choice of shard will distribute requests across multiple hash sets, it will also increase contention, since a single directory will have its inodes distributed across every hash set. I wonder if (inode >> K) % SHARD_COUNT might therefore lead to better performance for some values of K. Especially if each thread batched its requests to each hash set.
  
  In addition, tokio::sync::Mutex is quite a large structure! parking_lot::Mutex is, by comparison, only a single byte. avoiding tokio and using threads, would allow the use of the smaller Mutex which would keep each shard to a single cache line. (You might still lose that space to cache padding on machines that require 128-bytes for cache padding to prevent false sharing, but at least those bytes wouldn't be accessed.)
  
  Regardless, I thought the article was neat! And TIL about getattrlistbulk
  1. 3
    
    healeycodes 8 months ago
    
    Thanks @eta and @YogurtGuy for the feedback here! I was able to increase performance by ~28% by implementing your suggestions. I've written a follow-up post based on this feedback: https://healeycodes.com/optimizing-my-disk-usage-program
    1. 4
      
      YogurtGuy 8 months ago
      
      Wow! That's really neat! I'm glad the feedback was helpful.
      
      Looking through the most recent source of the program a few thoughts:
      
      Each access of SEEN_INODES requires an atomic operation (separate from the mutex operation), because OnceLock uses an atomic internally to coordinate between threads. As a result, it's going to be more efficient to say let seen_inodes = &*SEEN_INODES; in your code, and then access the state through that reference.
      
      Now that a single thread is less-likely to be accessing multiple HashSets, I think it should be possible to avoid having one lock/unlock operation per directory. Instead of invoking check_and_add_inode(), consider something like:
      
      let mut all_inodes: Vec<(u64, u64)> = ...; // (inode, file size) // Sorting the inodes will keep shards together, so it'll reduce the number of // locking operations we need. // TODO: If we sort by shard, then _every_ thread will always try to access the // lowest shard first. It might increase performance if we use a different sort // order per thread. For example, using shard_for_inode(inode) ^ PER_THREAD_VALUE // or something like that as the sort key will keep each thread accessing the // shards in a random order. While a _better_ hash function might make shuffle // the access order better, I'm not sure that the perf gain will be so high // that it'd be worth it to do something computationally intensive. // // One option (in part, because the Apple Silicion cacheline size is 128 bytes), // would be to have a per-thread (though, like with the note about SEEN_INODES, // you might want to avoid going through a thread_local variable for each access // ) [u8; SHARD_COUNT]. On initialization, it could be filled with 0..128, and // then shuffled. Then the sort key could be: // thread_local_permutation[shared_for_inode(inode)] as usize all_inodes.sort_unstable_by_key(|(inode, _)| shard_for_inode(inode)); let mut lock = None; // stores Option<(shard index, mutex guard for shard)> let mut size_acu = 0; for (inode, size) in all_inodes { let inode_shard = shard_for_inode(inode); // If the lock from the previous iteration was from the same shard, reuse it! if let Some((lock_shard, _)) = lock.as_ref() && lock_shard != inode_shard { lock = None; } let (_, shard_lock) = lock.get_or_insert_with(|| (inode_shard, SEEN_INODES[inode_shard].lock())); if shard_lock.insert(inode) { size_acu += size; } } // Drop this after the loop is over to avoid holding the lock longer than needed std::mem::drop(lock);
      
      Instead of passing strings around, it should be more efficient to use byte-slices. Every conversion from a byte-slice to a String involves checking that the byte-slice is valid UTF-8, and I don't think that should be necessary for most of the filesystem manipulations.
      
      Path::join needs to allocate memory for the full path joined path for each subdirectory. It should be possible to eliminate that by using openat() instead of open(). (This may also be faster to process in the kernel as well.) openat() takes a file descriptor of a base directory to reference. So, rather than doing open("/foo"); open("/foo/bar"); with openat(), it could be let foo = openat(AT_FDCWD, "/foo"); let bar = openat(foo, "bar").
      
      You might want to experiment with using different memory allocators instead of the system allocator. I frequently use mimalloc.
      
      Each call to get_dir_info() ends up allocating memory several times. Consider having get_dir_info() instead take a &mut DirInfo which get_dir_info() can .clear() before using it. The caller can use then keep reusing the allocations from a single DirInfo, instead of the vectors being freed and re-allocated for each iteration.
      
      get_dir_info() is also allocating Strings for each child entry that it encounters. It's possible to avoid this allocation, too. Instead of copying the contents of the names into a fresh String, DirInfo's subdirs field could be a Vec<&CStr> where it referneces C-strings that live directly in attrbuf. Actually implementing this will likely cause some gnarly lifetime issues, but removing this allocation as well should, at least in principle, be possible.
      
      Lastly, instead of invoking libc::__error(), there's a handy Rust function: std::io::Error::last_os_error() that should be easier to use.
      
      (I hope this isn't too much feedback, and that you find this useful :) )
      1. 1
        
        healeycodes 8 months ago
        
        Awesome! Thanks for more ideas here :)
2. 3
  
  freddyb 8 months ago
  
  rayon maybe. But tokio is a great choice.

2

lashomb 8 months ago

Dumac and Jalad at Tenagra.

2

prattmic Go Core Team 8 months ago

Nice article, I love a nice performance optimization!

Since this post prominently features Go profiling on macOS, I want to warn that ITIMER_PROF, the timer that backs Go's CPU profiles, is unfortunately broken on macOS in a way that can overcount time spent in system calls.

I'm not sure if this application is affected; clearly we expect most of the time to be spent in system calls. Plus, the final results simply use end-to-end wall time, so clearly those are valid. This is just something that unfortunately has to be kept in mind when profiling on macOS.