GitHub - gizemdal/CUDA-Flocking: Optimized flocking simulation based on the Reynolds Boids algorithm with uniform grid with semi-coherent memory access

CUDA Flocking Simulation

Boids simulation running in Coherent Uniform Grid mode - 50000 particles (FPS capped to 35)

Name: Gizem Dal
- LinkedIn, personal website

Project Summary

This is an optimized flocking simulation based on the Reynolds Boids algorithm with uniform grid with semi-coherent memory access. The simulation includes toggleable naïve and simple uniform grid modes for performance analysis purposes.

Implementation Thoughts

Increasing the number of boids slows down the performance in all modes (naive, uniform grid, coherent uniform grid) while decreasing it improves performance. In every simulation step, we find the neighbors of every single boid and update velocities and positions. In naive mode, the complexity scales up pretty quickly since our naive method iterates over all the boids to find neighbors for every single boid. We can observe a significant difference between result of increasing boid count in naive mode versus coherent/uniform grid modes since the code is significantly optimized to only iterate over "likely neighboring" boids within a grid range.

I experienced about a 20-30 fps improvement on the performance with the coherent uniform grid. I believe the reason for that is because we're reshuffling the position and velocity buffers to have boid data sorted and this allows the device to access this sequential data more efficiently versus accessing data scattered in memory.

Decreasing the cell width (thus checking for more than 8 neighboring cells) slows down the performance. Besides having to check more cells, we also have to check for more neighbor boids when we decrease our cellWidth. Decreasing the cell width while keeping the search radius constant results in having to account for more neighboring cells, thus we increase the possible number of boids that are within our search radius. If we also increase the number of boids (but keep the scene_scale the same), every cell has a higher chance of enclosing more boids compared to before. All of these conditions can impact the performance when we check 27 vs 8 neighboring cells.

Performance Analysis

The performance of the simulation is tested on a Predator G3-571 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz 2.81 GHz machine.

I set a CUDA event timer inside the step simulation functions for all 3 modes and recorded the elapsed time between before and after the kernels are run to compute the new velocities and update the positions of all the particles. Here are the results I have observed with N = 10000, blockSize = 128, VISUALIZATION = 1:

Mode	Naive	Naive Uniform	Semi-Coherent Uniform
Time (in ms)	15-16	2.7-3.4	2.6-3.2

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
cmake		cmake
external		external
images		images
shaders		shaders
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Flocking Simulation

Project Summary

Implementation Thoughts

Performance Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Flocking Simulation

Project Summary

Implementation Thoughts

Performance Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages