Threads per block Using shared memory

NeedWisdom · October 19, 2010, 12:34pm

I am porting an algorithm to the CUDA GPU and I plan to limit accesses to global memory by leveraging the per block shared memory. What are the implications of using one thread per block so that the one thread can use the shared memory vs multiple threads per block that use global memory? Is it a common strategy to use multiple blocks each with a single thread in order to have more shared memory per thread than usual?

NeedWisdom · October 19, 2010, 12:34pm

I am porting an algorithm to the CUDA GPU and I plan to limit accesses to global memory by leveraging the per block shared memory. What are the implications of using one thread per block so that the one thread can use the shared memory vs multiple threads per block that use global memory? Is it a common strategy to use multiple blocks each with a single thread in order to have more shared memory per thread than usual?

avidday · October 19, 2010, 2:31pm

Threads are executed in SIMD fashion in groups of 32 (a “warp” in CUDA nomenclature). Using 1 thread per block is wasting 96% of the computational capacity of the device, so to answer your question: no it is not a common strategy to use 1 thread per block.

avidday · October 19, 2010, 2:31pm

Threads are executed in SIMD fashion in groups of 32 (a “warp” in CUDA nomenclature). Using 1 thread per block is wasting 96% of the computational capacity of the device, so to answer your question: no it is not a common strategy to use 1 thread per block.

NeedWisdom · October 19, 2010, 3:05pm

In my design, each thread needs to manage a buffer. I could have this buffer draw from global memory but it would incur a memory access cost. Alternatively, I could have a single thread hog all the shared memory of a block and then have multiple blocks run in parallel each with a single thread–but this sounds like they won’t really run in parallel–yes? It is still possible to partition the shared memory among the threads of a block but I get the sense that shared memory is used for threads to cooperate which is not my intent.

NeedWisdom · October 19, 2010, 3:05pm

In my design, each thread needs to manage a buffer. I could have this buffer draw from global memory but it would incur a memory access cost. Alternatively, I could have a single thread hog all the shared memory of a block and then have multiple blocks run in parallel each with a single thread–but this sounds like they won’t really run in parallel–yes? It is still possible to partition the shared memory among the threads of a block but I get the sense that shared memory is used for threads to cooperate which is not my intent.

avidday · October 19, 2010, 4:49pm

How are you intending to get data into the shared memory in the first place?

Shared memory is a per multiprocessor resource. Simplifying things a bit, the number of blocks per multiprocessor will be the total shared memory per MP / shared memory requirements per block (registers and a few other things also have an impact). If you take the maximum amount of shared memory per block, only 1 block will run per MP. With only 1 thread per block that will yield very low occupancy and efficiency. Literally 99.9% of the GPU resources will be sitting idle.

That is what it is for - think of it as programmer controllable per block cache memory if you like.

avidday · October 19, 2010, 4:49pm

How are you intending to get data into the shared memory in the first place?

Shared memory is a per multiprocessor resource. Simplifying things a bit, the number of blocks per multiprocessor will be the total shared memory per MP / shared memory requirements per block (registers and a few other things also have an impact). If you take the maximum amount of shared memory per block, only 1 block will run per MP. With only 1 thread per block that will yield very low occupancy and efficiency. Literally 99.9% of the GPU resources will be sitting idle.

That is what it is for - think of it as programmer controllable per block cache memory if you like.

kbam · October 20, 2010, 4:35am

Just out of interest how big is the “buffer” each thread would manage ?
and how many buffers do you need ?

kbam · October 20, 2010, 4:35am

Just out of interest how big is the “buffer” each thread would manage ?
and how many buffers do you need ?

NeedWisdom · October 20, 2010, 2:29pm

There would be several buffers and arrays ranging in size. I think the totality of memory required by a thread would be about 512 bytes. My current approach is to dimension buffers according the a MAX number of threads per block in such a way as to partition the shared memory within a block so that each thread has a portion. In that sense, it’s not really shared but would be faster than using global memory.

NeedWisdom · October 20, 2010, 2:29pm

There would be several buffers and arrays ranging in size. I think the totality of memory required by a thread would be about 512 bytes. My current approach is to dimension buffers according the a MAX number of threads per block in such a way as to partition the shared memory within a block so that each thread has a portion. In that sense, it’s not really shared but would be faster than using global memory.

Topic		Replies	Views
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8477	February 12, 2008
blocks vs threads CUDA Programming and Performance	14	18378	February 27, 2007
Usage of shared memory CUDA Programming and Performance	12	428	February 15, 2025
Not enough shared mem CUDA Programming and Performance	5	5883	November 3, 2009
Shared Memory Buffer CUDA Programming and Performance	1	2727	May 13, 2011
Maximising memory per thread CUDA Programming and Performance	4	3365	May 3, 2010
Optimisation Strategies when running out of shared memory CUDA Programming and Performance	1	596	March 12, 2011
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	4068	August 14, 2007
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5804	March 5, 2008
So how much shared mem do we really have ? knowing cuda hw better = better optimization CUDA Programming and Performance	0	1949	November 20, 2009

Threads per block Using shared memory

Related topics