I have a program. I tried 2 thread block sizes, 192 and 256. When use block size of 192, there can be 2 thread blocks simutaneously on an SM, while using block size of 256, there can only be 1 thread block due to shared memory capacity. The 192 version outperforms 256 version by over 20%. The result seems reasonable, because there are more threads to hide memory latency. However, using more threads only works when there are not enough threads to hide latency originally. My program consists of memory access part and computation part. When I comment out one of the parts, and run the program, 192 version and 256 version have the same performance(commentation does not affect number of thread blocks running simutaneously on an SM). Thus, I assume for both memory access and computation part, there are enough threads and independent instructions to hide the latency. But why when they are together there is performance difference?
You need to be a bit circumspect about drawing any serious conclusions from your commenting experiment. Unless you are very careful, commenting out code sections can result in a lot of code (sometimes the whole kernel) getting optimized away by the nvopencc dead code removal optimization.
By the assembly code generated by decuda, I think the compiler doesn’t optimize away anything.
P. S. A very important thing I forgot to mention is that, when I set the thread block size to 384, the performance is similar to the 256 version. I think this fact rules out the possibility that there are idle cycles in the 256 version. However, why two 192 thread block is faster than one 384 thread block?Does synchronization make a difference?
Have you had a look at each execution configuration using the profiler? That might help shed some light on the differences between the three cases. From what we have been told, instruction pipeline latency shouldn’t be an issue in any of your three cases, which makes me think it is memory access related. Does the kernel use shared memory?
I find no useful information in profile. All global memory reads are through texture load, and all writes are coalesced. They use shared memory a lot, but no bank conflicts. Instruction count are all normal. Only execution time differs. I tried not to use texture loads but direct load, the performance gap is similar.
Yes. While one block is waiting for all its threads to catch up at the barrier, the other block can continue run.
Barrier only cause threads to wait when there are long latency instructions, such as global memory loads. This is the only thing I can think of in my code. I don’t use SFU, and there are enough threads to hide register read-after-write latency. However, when I comment out the computation part, 192 and 384 version have the same performance.
If you comment out the computation part, your program becomes bandwidth bound. In that case 256 threads in an SM is enough to get the maximum performance. However in your original program, you have both computation and memory access part, which obviously have different bottleneck. Because of the computation part, your threads cannot issue memory access instruction as fast as no-computation kernel and you need more threads in order to get the maximum performance. That would explain why 384 threads has better performance than 256.
Does this explanation make sense?
I think your explanation makes some sense. Most programs have both memory access and computation part, so your explanation suggests that in a kernel, we should try to cluster all memory loads together and stores together in order to utilize memory bandwidth, am I right? Why the speed of issuing loads are so important if there is enough threads and instructions to hide latency? If we scatter loads everywhere in the kernel instead of clustering them, does that increase mem latency?
However, your answer does not explain why on each SM, 2 blocks of 192 threads outperform 1 block of 384 threads. 1 block of 384 threads has similar performance with 1 block of 256 threads