Reduction Block Size Optimization Questions regarding the example project

I’ve been working on porting a small (2 functions), well-used library to CUDA. The code should benefit hugely from parallelization, since much of the computation involves several reductions that take place.

I have examined the reduction example quite throughly, and after running some tests with the --shmoo option, I have a few questions:

  1. Why do all of the reduction kernels (with the exception of #6 – the most optimized one) suddenly drop in computation time when they hit the 16M and 32M element test? Is there some kind of overflow going on here?

  2. When running the test on 64, 128 (default), and 256 threads, it seems that the 128-thread version runs the fastest. Is there any particular reason for this? From my (admittedly limited) understanding, the code should run faster if there are more threads in the block (assuming they can all fit inside the memory space of the block). Can someone explain why this is not the case?

  3. Please explain the SharedMemory struct. I understand that it has to do with templating (so that the function can be used with various types), but I’m a bit confused about what it returns; does it return a pointer to the entire shared memory block? Or is the block divided by the number of threads, such that the pointer returned points to the memory allotted to the calling thread?

P.S. – To the CUDA developers, I also noted that in the header comments (where the command line options are defined, you forgot to mention the ‘type’ argument…its in the code, but not in the comments. Perhaps you can add this in the next release so that others can test with the other data types? It’s a shame to have templated code and not be able to use it ;)

Bump for an answer?

Compile in debug mode, you might get a kernel launch failure reported. Shorter runtimes for more elements are suspicious ;)

Hmm, I would need to look at the code and such. If you run the example in the visual profiler you might see differences (it can be that you get better occupancy because of shared memory constraints)

You can see it in Sharedmem.cuh

template <>

struct SharedMemory <float>


    __device__ float* getPointer() { extern __shared__ float s_float[]; return s_float; }    


It returns a pointer to the shared memory block, so afterward threads need to offset into this array themselves like when not using the SharedMemory struct.