Results

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

Xuanyi Zhou
- LinkedIn
- GitHub
Tested on: Windows 10, i7-9750H @ 2.60GHz 32GB, RTX 2060 6GB (Personal Laptop)

Results

Performance Analysis

Questions

Q: For each implementation, how does changing the number of boids affect performance? Why do you think this is?

A: The more boids there are, the longer it takes to simulate one time step. The naive implementation is O(n²), where n is the number of boids. For the grid-based implementations, when there are relatively few boids, the expected number of boids in one boid's neighborhood is constant, so the time complexity is Ω(n) at this point. However, as the number of boids increase, it'll converge towards O(n²) as well since the simulation happens in a finite volume. Although the computation happens on the GPU, it can only reduce computation time by a constant factor when the number of boids increases beyond the amount of available computation hardware.
Q: For each implementation, how does changing the block count and block size affect performance? Why do you think this is?

A: From my profiling, it seems that increasing block size reduces computation time. My hypothesis is that increasing block size reduces the amount of scheduling that needs to be done. The reason could also be that the warp size is larger than the block size, preventing the full utilization of computation hardware; however, this is not supported by profiling since there's still some improvement when changing the block size from 32 to 64.
Q: For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?

A: Yes. This is slightly surprising considering the extra amount of copying that needs to be done. Apparently in this case the cost of gathering/scattering from one set of buffers to another and back is far less than that of large amounts of incoherent memory accesses.
Q: Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!

A: Yes. Using 27 cells is actually faster in my profiling. The volume of 27 half-sized cells is smaller than 8 full-sized cells by over a half. This would allow each thread to check a smaller number of boids, and would be especially beneficial when the total number of boids is large.

Profiling

CUDA events are used to record the execution time of each timestep. See data.xlsx for raw data and graphs.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
cmake		cmake
external		external
images		images
shaders		shaders
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md
data.xlsx		data.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Results

Performance Analysis

Questions

Profiling

Average Execution Time of All Versions With Respect to the Number of Boids

Average Execution Time of All Versions With Respect to the Number of Boids, Logarithm Scale

Execution Time of the Coherent Grid Version With Respect to the Number of Boids

Execution Time of the Coherent Grid Version With Respect to Block Size

Execution Time of the Coherent Grid Version With Respect to Cell Size

About

Uh oh!

Releases

Packages

Languages

lukedan/Project1-CUDA-Flocking

Folders and files

Latest commit

History

Repository files navigation

Results

Performance Analysis

Questions

Profiling

Average Execution Time of All Versions With Respect to the Number of Boids

Average Execution Time of All Versions With Respect to the Number of Boids, Logarithm Scale

Execution Time of the Coherent Grid Version With Respect to the Number of Boids

Execution Time of the Coherent Grid Version With Respect to Block Size

Execution Time of the Coherent Grid Version With Respect to Cell Size

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages