Skip to content

lukedan/Project1-CUDA-Flocking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

  • Xuanyi Zhou
  • Tested on: Windows 10, i7-9750H @ 2.60GHz 32GB, RTX 2060 6GB (Personal Laptop)

Results

Image

Image

Performance Analysis

Questions

  • Q: For each implementation, how does changing the number of boids affect performance? Why do you think this is?

    A: The more boids there are, the longer it takes to simulate one time step. The naive implementation is O(n²), where n is the number of boids. For the grid-based implementations, when there are relatively few boids, the expected number of boids in one boid's neighborhood is constant, so the time complexity is Ω(n) at this point. However, as the number of boids increase, it'll converge towards O(n²) as well since the simulation happens in a finite volume. Although the computation happens on the GPU, it can only reduce computation time by a constant factor when the number of boids increases beyond the amount of available computation hardware.

  • Q: For each implementation, how does changing the block count and block size affect performance? Why do you think this is?

    A: From my profiling, it seems that increasing block size reduces computation time. My hypothesis is that increasing block size reduces the amount of scheduling that needs to be done. The reason could also be that the warp size is larger than the block size, preventing the full utilization of computation hardware; however, this is not supported by profiling since there's still some improvement when changing the block size from 32 to 64.

  • Q: For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?

    A: Yes. This is slightly surprising considering the extra amount of copying that needs to be done. Apparently in this case the cost of gathering/scattering from one set of buffers to another and back is far less than that of large amounts of incoherent memory accesses.

  • Q: Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!

    A: Yes. Using 27 cells is actually faster in my profiling. The volume of 27 half-sized cells is smaller than 8 full-sized cells by over a half. This would allow each thread to check a smaller number of boids, and would be especially beneficial when the total number of boids is large.

Profiling

CUDA events are used to record the execution time of each timestep. See data.xlsx for raw data and graphs.

Average Execution Time of All Versions With Respect to the Number of Boids

Image

Average Execution Time of All Versions With Respect to the Number of Boids, Logarithm Scale

Image

Execution Time of the Coherent Grid Version With Respect to the Number of Boids

Image

Execution Time of the Coherent Grid Version With Respect to Block Size

This is tested using 100000 boids.

Image

Execution Time of the Coherent Grid Version With Respect to Cell Size

This is tested using 100000 boids.

Image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 41.6%
  • C++ 39.7%
  • CMake 16.8%
  • Other 1.9%