I’ve been working on porting a small (2 functions), well-used library to CUDA. The code should benefit hugely from parallelization, since much of the computation involves several reductions that take place.
I have examined the reduction example quite throughly, and after running some tests with the --shmoo option, I have a few questions:
-
Why do all of the reduction kernels (with the exception of #6 – the most optimized one) suddenly drop in computation time when they hit the 16M and 32M element test? Is there some kind of overflow going on here?
-
When running the test on 64, 128 (default), and 256 threads, it seems that the 128-thread version runs the fastest. Is there any particular reason for this? From my (admittedly limited) understanding, the code should run faster if there are more threads in the block (assuming they can all fit inside the memory space of the block). Can someone explain why this is not the case?
-
Please explain the SharedMemory struct. I understand that it has to do with templating (so that the function can be used with various types), but I’m a bit confused about what it returns; does it return a pointer to the entire shared memory block? Or is the block divided by the number of threads, such that the pointer returned points to the memory allotted to the calling thread?
P.S. – To the CUDA developers, I also noted that in the header comments (where the command line options are defined, you forgot to mention the ‘type’ argument…its in the code, but not in the comments. Perhaps you can add this in the next release so that others can test with the other data types? It’s a shame to have templated code and not be able to use it ;)