Lots of Threads vs. Shared Memory

Hey everyone,

Hopefully this is a quick question. Say I have a 1024 by 1024 float1 array and each element needs to have the same operation done to it. Is it better to run one thread for each individual element (Max Threads = 768, # of blocks = 1024^2/768) or to load one row into shared memory for 1 thread and have only 1024 threads running in 2 blocks? Or, conversely, are both ways horrendous and there is a better way?

Thanks!

that depends on your program, you really have to benchmark (unless you e.g. only add a value to it, then you should not put it in shared mem.
Note, you can have only 512 threads per block, to have 768 threads per multiprocessor, use max. 256 threads per block.

Ok, that helps a lot, actually. Each element has very few operations performed on it, so the speed up will probably be minimal.

So when I call the GPU function <<<blocks, threads>>>Whatever, threads can only = 256? I am using 512 right now and everything is working out ok.

You can use 512, but if you have block synchronization and want to have the ability to run more than one block at a time per multiprocessor, you can run a max of 384 threads (since you can only have 768 on a multiprocessor)

Brian

Thanks both for bringing the thread count issue to my attention. I just modified the code so that my MAX_THREADS = 256 and it has improved the speed nicely at large input data loads. Thanks!

Try to run your code with the Visual Profiler so you can see what the exact load is on your GPU to see if 256threads per block is a good thought and not 384…

its worth trying.

Unfortunately, I am running the code in a DLL. I DL the profiler earlier today and it looks like that the profiler needs an exe file. Is this correct? I can try the 384 threads and see if that works.

Another quick question, how much shared memory is available per block? I can’t seem to find the answer in the CUDA Programming Guide.

16 kB (See Appendix A.1, pg. 74 in the CUDA 1.1 Programming Guide)

If you use it all, then only one block will run per multiprocessor.

For all hardware-related info, consult your devicequery program from the SDK :D