I am porting an algorithm to the CUDA GPU and I plan to limit accesses to global memory by leveraging the per block shared memory. What are the implications of using one thread per block so that the one thread can use the shared memory vs multiple threads per block that use global memory? Is it a common strategy to use multiple blocks each with a single thread in order to have more shared memory per thread than usual?
I am porting an algorithm to the CUDA GPU and I plan to limit accesses to global memory by leveraging the per block shared memory. What are the implications of using one thread per block so that the one thread can use the shared memory vs multiple threads per block that use global memory? Is it a common strategy to use multiple blocks each with a single thread in order to have more shared memory per thread than usual?
Threads are executed in SIMD fashion in groups of 32 (a “warp” in CUDA nomenclature). Using 1 thread per block is wasting 96% of the computational capacity of the device, so to answer your question: no it is not a common strategy to use 1 thread per block.
Threads are executed in SIMD fashion in groups of 32 (a “warp” in CUDA nomenclature). Using 1 thread per block is wasting 96% of the computational capacity of the device, so to answer your question: no it is not a common strategy to use 1 thread per block.
In my design, each thread needs to manage a buffer. I could have this buffer draw from global memory but it would incur a memory access cost. Alternatively, I could have a single thread hog all the shared memory of a block and then have multiple blocks run in parallel each with a single thread–but this sounds like they won’t really run in parallel–yes? It is still possible to partition the shared memory among the threads of a block but I get the sense that shared memory is used for threads to cooperate which is not my intent.
In my design, each thread needs to manage a buffer. I could have this buffer draw from global memory but it would incur a memory access cost. Alternatively, I could have a single thread hog all the shared memory of a block and then have multiple blocks run in parallel each with a single thread–but this sounds like they won’t really run in parallel–yes? It is still possible to partition the shared memory among the threads of a block but I get the sense that shared memory is used for threads to cooperate which is not my intent.
How are you intending to get data into the shared memory in the first place?
Shared memory is a per multiprocessor resource. Simplifying things a bit, the number of blocks per multiprocessor will be the total shared memory per MP / shared memory requirements per block (registers and a few other things also have an impact). If you take the maximum amount of shared memory per block, only 1 block will run per MP. With only 1 thread per block that will yield very low occupancy and efficiency. Literally 99.9% of the GPU resources will be sitting idle.
That is what it is for - think of it as programmer controllable per block cache memory if you like.
How are you intending to get data into the shared memory in the first place?
Shared memory is a per multiprocessor resource. Simplifying things a bit, the number of blocks per multiprocessor will be the total shared memory per MP / shared memory requirements per block (registers and a few other things also have an impact). If you take the maximum amount of shared memory per block, only 1 block will run per MP. With only 1 thread per block that will yield very low occupancy and efficiency. Literally 99.9% of the GPU resources will be sitting idle.
That is what it is for - think of it as programmer controllable per block cache memory if you like.
Just out of interest how big is the “buffer” each thread would manage ?
and how many buffers do you need ?
Just out of interest how big is the “buffer” each thread would manage ?
and how many buffers do you need ?
There would be several buffers and arrays ranging in size. I think the totality of memory required by a thread would be about 512 bytes. My current approach is to dimension buffers according the a MAX number of threads per block in such a way as to partition the shared memory within a block so that each thread has a portion. In that sense, it’s not really shared but would be faster than using global memory.
There would be several buffers and arrays ranging in size. I think the totality of memory required by a thread would be about 512 bytes. My current approach is to dimension buffers according the a MAX number of threads per block in such a way as to partition the shared memory within a block so that each thread has a portion. In that sense, it’s not really shared but would be faster than using global memory.