Should I expect a speed up by increasing number of threads per block?

I have a Titan Black and a 192x192 parallelizable job. My kernel has two 2D char matrices that has a dimension of BLKSZ*BLKSZ.

When BLKSZ=16, I have 12x12 blocks of 256 threads per block. Since I have 15 SMXes, I launched the kernel to execute 15 blocks per iteration. It took 10 iterations and 330sec to finish.

My program was developed when I had GTX470. I heard that Titan Black now supports 32x32 threads per block, so I thought maybe increasing BLKSZ to 32 can speed up my program.

Now it does finish in three iterations. However, each iteration is 20x slower. As a result, it took 30min to finish. Overall, 6x slower. What is causing the slow down? Shouldn’t each iteration should roughly run at the same speed as before and the 32x32 implementation should be 3x faster?

Thanks in advance for your help.

it is difficult to follow your numbers - how must one relate 192x192 to BLKSZ*BLKSZ being 16 or 32 to 12x12 blocks of 256 threads per block exactly?
this makes it difficult to follow your changes

in hypothesizing about the slow-down, check whether you are not now spilling when altering the block dimension from 256 to 1024 threads
and note whether your memory accesses are still coalesced, when altering BLKSZ from 16 to 32

Hi little_jimmy

Thanks for your reply. The reason why I need to split my jobs into different iterations is due to the memory requirement.

When I run in the BLKSZ=16 mode, I need to allocate heap memory of 1GB for 14 blocks of 256 threads. As the program was developed back in the date when I was using GTX 470, I need to run the 192x192 jobs in ten iterations. 

I am porting the code to Titan Black, it seems to work as expected when BLKSZ=16. I learned that sm3.5 allows 32x32 threads per block, so I tried doubling the BLKSZ to see if I can see improved performance. This time I should only need 4.6GB heap memory to do the job in three iterations which should be within the limit of the Titan Black. I think my memory estimation is ok because the program still ran to completion with correct result despite significant slow down. 

Can you elaborate more on the terms like "spilling" and "memory accesses are coalesced"???

Thanks a lot!

i understand that you are porting the code
i follow that you are adjusting the thread block size, and BLKSZ
however, what i do not follow, is:

192x192 jobs and how it relates to 10 iterations of 14x256
the only way i can vaguely make sense of it, is if i take 10 (11) X 14 X 256, which is close to 192x192

spilling is when local memory starts to become/ behave like shared and/ or global memory, because SM registers are exhausted and insufficient to store local memory anymore
you have increased your block size by a factor of 4 (256 > 1024); hence, local memory likely increased by a factor of 4, increasing the chance of spilling
[compile with the ptxas -v flag to note register usage/ spilling statistics]

also, when you alter BLKSZ, and if the 2D array you mentioned, connected to BLKSZ is stored in global memory, altering BLKSZ would alter the indexing/ offset of the array in global memory, in turn potentially affecting global memory read coalescence
[profiling should reveal to what degree your memory reads are coalesced]

i almost want to put my money on spilling

Thanks for your reply. I tried --ptxas-options=-v with nvcc. I got the following output. Does that mean there is a spilling?

ptxas info : Compiling entry function ‘kinship2’ for ‘sm_35’
ptxas info : Function properties for kinship2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 2048 bytes smem, 380 bytes cmem[0], 12 bytes cmem[2]

How do I run profiling to check whether memory reads are coalesced?

There is no spilling, due to 0 bytes stack frame and 0 byte spill loads/stores.

"How do I run profiling to check whether memory reads are coalesced? "

the visual profiler - run your application in the profiler with BLKSZ = 16 and BLKSZ = 32 and note the differences

I ran nvprof with --print-gpu-trace and --benchmark and obtained two output. As you can see, each iteration was 20x slower in the BLKSZ=32 case. How can I see anything wrong here?

BLKSZ=16

==30238== NVPROF is profiling process 30238, command: ./kinship --benchmark
==30238== Profiling application: ./kinship --benchmark
==30238== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
278.79ms 960ns - - - - - 121B 126.04MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
278.93ms 546.60us - - - - - 4.5929MB 8.4027GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
279.49ms 245.79us (1631 1 1) (16 1 1) 22 0B 0B - - GeForce GTX TIT 1 7 alfreq [21]
279.74ms 768ns - - - - - 60B 78.125MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
281.93ms 36.4823s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [24]
36.7643s 2.2720us - - - - - 121B 53.257MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
36.7643s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
36.7643s 35.6427s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [28]
72.4070s 2.0160us - - - - - 121B 60.020MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
72.4071s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
72.4071s 37.2189s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [32]
109.626s 10.560us - - - - - 121B 11.458MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
109.626s 896ns - - - - - 60B 66.964MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
109.626s 37.7143s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [36]
147.340s 1.7280us - - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
147.340s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
147.341s 38.8855s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [40]
186.226s 1.7280us - - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
186.226s 736ns - - - - - 60B 81.522MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
186.226s 38.8493s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [44]
225.075s 1.7600us - - - - - 121B 68.750MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
225.075s 800ns - - - - - 60B 75.000MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
225.075s 38.6450s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [48]
263.721s 1.7600us - - - - - 121B 68.750MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
263.721s 736ns - - - - - 60B 81.522MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
263.721s 35.9260s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [52]
299.647s 1.7280us - - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
299.647s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
299.647s 12.2829s (11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [56]
311.930s 1.7280us - - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
311.930s 11.009us - - - - - 123.90KB 11.255GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
311.930s 11.232us - - - - - 123.90KB 11.031GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

BLKSZ=32

==29653== NVPROF is profiling process 29653, command: ./kinship --benchmark
==29653== Profiling application: ./kinship --benchmark
==29653== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
217.85ms 1.0880us - - - - - 36B 33.088MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
217.99ms 652.17us - - - - - 5.0135MB 7.6875GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
218.65ms 142.95us (816 1 1) (32 1 1) 22 0B 0B - - GeForce GTX TIT 1 7 alfreq [21]
218.81ms 800ns - - - - - 60B 75.000MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
220.94ms 790.944s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [24]
791.165s 2.0160us - - - - - 36B 17.857MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
791.165s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
791.165s 818.127s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [28]
2e+03s 2.0480us - - - - - 36B 17.578MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
2e+03s 213.155s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [32]
2e+03s 1.7280us - - - - - 36B 20.833MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 13.152us - - - - - 147.46KB 11.212GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 12.480us - - - - - 147.46KB 11.815GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

you are using the command line profiler
perhaps use the visual (gui) profiler - the analysis should be richer, and you should be able to delve deeper into your kernel’s memory accesses, etc

refer to the cuda profiler user guide

I am trying the Visual Profiler. But I am getting many “The data needed to calculate xxxxx could not be collected” errors. Does that mean I need to recompile my code with some sort of debug flag?

can you post a detailed error message?

Looks like these

“The data needed to calculate multiprocessor occupancy could not be collected”
“The data needed to calculate global memory load efficiency could not be collected”
“The data needed to calculate global memory store efficiency could not be collected”
“The data needed to calculate shared memory efficiency could not be collected”
“The data needed to calculate warp execution efficiency could not be collected”