Lots of Threads vs. Shared Memory

AustinMcElroy · February 11, 2008, 5:58pm

Hey everyone,

Hopefully this is a quick question. Say I have a 1024 by 1024 float1 array and each element needs to have the same operation done to it. Is it better to run one thread for each individual element (Max Threads = 768, # of blocks = 1024^2/768) or to load one row into shared memory for 1 thread and have only 1024 threads running in 2 blocks? Or, conversely, are both ways horrendous and there is a better way?

Thanks!

DenisR · February 11, 2008, 7:26pm

that depends on your program, you really have to benchmark (unless you e.g. only add a value to it, then you should not put it in shared mem.
Note, you can have only 512 threads per block, to have 768 threads per multiprocessor, use max. 256 threads per block.

AustinMcElroy · February 11, 2008, 7:41pm

Ok, that helps a lot, actually. Each element has very few operations performed on it, so the speed up will probably be minimal.

So when I call the GPU function <<<blocks, threads>>>Whatever, threads can only = 256? I am using 512 right now and everything is working out ok.

bbudge · February 11, 2008, 8:23pm

You can use 512, but if you have block synchronization and want to have the ability to run more than one block at a time per multiprocessor, you can run a max of 384 threads (since you can only have 768 on a multiprocessor)

Brian

AustinMcElroy · February 11, 2008, 8:44pm

Thanks both for bringing the thread count issue to my attention. I just modified the code so that my MAX_THREADS = 256 and it has improved the speed nicely at large input data loads. Thanks!

jordyvaneijk · February 11, 2008, 9:24pm

Try to run your code with the Visual Profiler so you can see what the exact load is on your GPU to see if 256threads per block is a good thought and not 384…

its worth trying.

AustinMcElroy · February 11, 2008, 10:23pm

Unfortunately, I am running the code in a DLL. I DL the profiler earlier today and it looks like that the profiler needs an exe file. Is this correct? I can try the 384 threads and see if that works.

AustinMcElroy · February 11, 2008, 10:44pm

Another quick question, how much shared memory is available per block? I can’t seem to find the answer in the CUDA Programming Guide.

seibert · February 12, 2008, 12:53am

16 kB (See Appendix A.1, pg. 74 in the CUDA 1.1 Programming Guide)

If you use it all, then only one block will run per multiprocessor.

DenisR · February 12, 2008, 6:04pm

For all hardware-related info, consult your devicequery program from the SDK :D

Topic		Replies	Views
finding the best number of threads per block CUDA Programming and Performance	3	7852	January 29, 2010
efficiency of block/thread ratios CUDA Programming and Performance	2	3818	April 18, 2007
Not enough shared mem CUDA Programming and Performance	5	5772	November 3, 2009
number of threads and registers CUDA Programming and Performance	10	4868	March 14, 2008
2 blocks versus 3 blocks CUDA Programming and Performance	5	4919	August 3, 2009
thread vs block CUDA Programming and Performance	1	1372	July 9, 2009
Optimisation Strategies when running out of shared memory CUDA Programming and Performance	1	555	March 12, 2011
Performance in different thread-block schemes CUDA Programming and Performance	5	2349	September 19, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27744	February 15, 2010
The choose of grid size and block size CUDA Programming and Performance	8	3400	May 8, 2024

Lots of Threads vs. Shared Memory

Related topics