You could try to use cudaMallocHost or cudaMallocManaged instead of cudaMalloc to allow parts of the data to be stored in cpu ram, but it can be accessed by the gpu. I do not know, however, if this works on Windows.
How can users tell whether it is running successfully? When I compile and run the posted code, it writes the GPUs present in my system to the console, and opens a window called “Circle Window”. It also appears to chew through all 32 GB of my system memory, at which point I terminated the application.
You might want to add a command line argument for the cube size to the application, then it will be easy to test with successively larger cube sizes until it fails. BTW, I don’t see status checking on the CUDA API calls, you might want to add that as well and terminate the application if one of those fails.
Here’s the report of the test:
cudaMallocHost() allocate on RAM and shared VRAM, thus the cube cannot be displayed since my shared VRAM is limited to 8GB.
cudaMallocManaged() allocate on RAM and dedicated VRAM, same issue as cudaMallocHost() and cudaMalloc().
Noted that both of those malloc kill the CUDA performance by 70%
Thanks anyway.
I have a full CPU version of my engine and the cube of 1000^3 is displaying, despite my 16 GB of RAM.
I guess you are just testing the largest size you can cudaMalloc. but you should be able to estimate how much memory you need without having to crowdsource it. sizeof your particle is 32 bytes. (x*x*x*32 + some buffer) < GPU memory size For x=1000 you’re going to need a GPU with more than 32GB. There aren’t too many of those. The Quadro RTX8000 comes to mind. Perhaps you should see if you can reduce the size of your particle.
Because your particle struct is using a __m128 datatype which must be aligned on a 16-byte boundary, your struct is forced to use 32 bytes even though it only “needs” 16 + 4 bytes. You could cut your memory consumption almost in half just by rearranging from AoS to SoA. (and do you really need 16 bytes for coordinates? - it doesn’t look like it - if you drop the W element, you could cut your memory consumption in half)
I can get to at least a SpaceDim of 600 on my RTX2070, and get about 40FPS