I was just wondering abou the significance of the thread block size while running a kernal on the graphic card using CUDA.
I we have a bigger block size, there are higher number of threads running concurrently in a block than if we have a smaller block size. What difference does it make since there is a limit to the warp size of a block anyway?
While we are at it, how does the grid dimension influence the kernal execution?
I tried playing with different values of block size and grid size and measured the time for execution in each case. But I tried on computationally less intensive programs. As a result I could not find much variation for the different cases.
I’m also not sure how the shared memory gets affected by the number of threads in a block since all of them have the same shared memory.
As for your question, I’m not really sure which option is better, but I guess it must have something to do with memory considerations… :wacko:
To get maximum performance – you have to run multiple threads – so that latencies are hidden.
This multiple threads running simultaneously can be acheived in two ways:
Run one single block with many threads as possible.
Run multiple blocks each having less threads
When I say single, multiple – I am talking about CUDA occupancy here i.e. the number of blocks that can simultaneously run on a single multiprocessor and NOT about the number of blocks you spawn for your kernel.
This has been discussed many timse in this forum. Kindly perform a search.
As far as I am concerned, I have seen good performance with just 32-threads per block and by using multiple blocks per multiprocessor (occupancy)
I have posted about this before. These are what I do.
32 threads per block
I never use syncthreads()
I never use double-buffering (so my shared mem footprint is less)
I usually try to run atleast 6 active blocks per multiprocessor. (192 threads are needed to hide register latencies anyway)
The only disadvantage is that the number of registers is used per blocks is 64NUM_REGISTERS_PER_THREAD instead of 32NUM_REGISTERS_PER_THREAD. So, if your register count is big – this can affect your occupancy and any occupancy less than 6 blocks is bound to be slow for this configuration.
The best thing you can do is to write your kernel to work for any blockDim and then benchmark it for all block dimensions that are multiples of 32. Pick the fastest size for your production application. Different kernels will have different optimal block sizes. For example: Sarnath’s code is optimal at 32 threads per block, while many of my kernels have a 100% performance penalty at that size compared to the optimal at 448. As the block size sweeps through from 64 to 448, performance dips up and down ~50%.
If you want to achieve the maximum throughput of the device, you’ll need to launch more than ~14,000 threads as the device is capable of keeping that many in flight.
Good strategy, but it is not great since you have to put yourself with many constraints. Sometime, i found syncthreads() cost almost nothing and it does decrease register usages and increase occupancy. I think shared mem is a great idea, especially for archiving high coalesced access, it makes the different between CUDA and Graphic APIs so I exploit it whenever i can
What is the double-buffering technique that you mention ?
THats news to me. My experiments showed that __syncthreads() does eat up performance - espescially when you use them in a FOR loop.
I dont understand why you are talking about highly-coalesced shared memory access here.
Anyway, to talk about “double-buffering”, let me quote a small example
consider this FOR loop:
for(i=0; i<N; i++)
a[i] = a[i] + a[i+1];
Try considering parallelizing this FOR loop. First instance, you would assign N threads to each N element. All of them will load a[threadId], a[ThreadId+1] and then add them and then store in a[threadId].
When you have more than 1 WARP perform this operation inside a block – you can see that there will be RACE condition between the last thread of one warp and the first thread of the next warp – because WARP scheduling order is NOT deterministic. One way to do is to load a[i] and a[i+1] in local variables and do a syncthreads() and then add them and store them appropriately. But that would prove to be costly because of the __syncthreads(). The usual way to do this is by double buffering.
Have 2 arrays: say b[i] = a[i]+a[i+1] and then return “b” instead of “a” – That is what is called double-buffering – A read buffer + A write buffer – can help you parallelize algorithms that look inherently un-parallelizable.
Now consider the 32-thread block case:
Within a WARP, the loads all happen synchronously thereby eliminating race conditions. So, if I use only 32 threads inside a block – then I can totally eliminate race conditions INSIDE the block.
As long as each block is interested in a separate buffer, I would totally eliminate double buffering and syncthreads() and live happily with a better performance.