Should I expect a speed up by increasing number of threads per block?

ymc · January 26, 2015, 5:06am

I have a Titan Black and a 192x192 parallelizable job. My kernel has two 2D char matrices that has a dimension of BLKSZ*BLKSZ.

When BLKSZ=16, I have 12x12 blocks of 256 threads per block. Since I have 15 SMXes, I launched the kernel to execute 15 blocks per iteration. It took 10 iterations and 330sec to finish.

My program was developed when I had GTX470. I heard that Titan Black now supports 32x32 threads per block, so I thought maybe increasing BLKSZ to 32 can speed up my program.

Now it does finish in three iterations. However, each iteration is 20x slower. As a result, it took 30min to finish. Overall, 6x slower. What is causing the slow down? Shouldn’t each iteration should roughly run at the same speed as before and the 32x32 implementation should be 3x faster?

Thanks in advance for your help.

little_jimmy · January 26, 2015, 6:25am

it is difficult to follow your numbers - how must one relate 192x192 to BLKSZ*BLKSZ being 16 or 32 to 12x12 blocks of 256 threads per block exactly?
this makes it difficult to follow your changes

in hypothesizing about the slow-down, check whether you are not now spilling when altering the block dimension from 256 to 1024 threads
and note whether your memory accesses are still coalesced, when altering BLKSZ from 16 to 32

ymc · January 26, 2015, 8:12am

Hi little_jimmy

Thanks for your reply. The reason why I need to split my jobs into different iterations is due to the memory requirement.

When I run in the BLKSZ=16 mode, I need to allocate heap memory of 1GB for 14 blocks of 256 threads. As the program was developed back in the date when I was using GTX 470, I need to run the 192x192 jobs in ten iterations. 

I am porting the code to Titan Black, it seems to work as expected when BLKSZ=16. I learned that sm3.5 allows 32x32 threads per block, so I tried doubling the BLKSZ to see if I can see improved performance. This time I should only need 4.6GB heap memory to do the job in three iterations which should be within the limit of the Titan Black. I think my memory estimation is ok because the program still ran to completion with correct result despite significant slow down. 

Can you elaborate more on the terms like "spilling" and "memory accesses are coalesced"???

Thanks a lot!

little_jimmy · January 26, 2015, 10:00am

i understand that you are porting the code
i follow that you are adjusting the thread block size, and BLKSZ
however, what i do not follow, is:

192x192 jobs and how it relates to 10 iterations of 14x256
the only way i can vaguely make sense of it, is if i take 10 (11) X 14 X 256, which is close to 192x192

spilling is when local memory starts to become/ behave like shared and/ or global memory, because SM registers are exhausted and insufficient to store local memory anymore
you have increased your block size by a factor of 4 (256 > 1024); hence, local memory likely increased by a factor of 4, increasing the chance of spilling
[compile with the ptxas -v flag to note register usage/ spilling statistics]

also, when you alter BLKSZ, and if the 2D array you mentioned, connected to BLKSZ is stored in global memory, altering BLKSZ would alter the indexing/ offset of the array in global memory, in turn potentially affecting global memory read coalescence
[profiling should reveal to what degree your memory reads are coalesced]

i almost want to put my money on spilling

ymc · January 26, 2015, 10:52am

Thanks for your reply. I tried --ptxas-options=-v with nvcc. I got the following output. Does that mean there is a spilling?

ptxas info : Compiling entry function ‘kinship2’ for ‘sm_35’
ptxas info : Function properties for kinship2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 2048 bytes smem, 380 bytes cmem[0], 12 bytes cmem[2]

How do I run profiling to check whether memory reads are coalesced?

cbuchner1 · January 26, 2015, 11:05am

There is no spilling, due to 0 bytes stack frame and 0 byte spill loads/stores.

little_jimmy · January 26, 2015, 1:31pm

"How do I run profiling to check whether memory reads are coalesced? "

the visual profiler - run your application in the profiler with BLKSZ = 16 and BLKSZ = 32 and note the differences

ymc · January 26, 2015, 4:27pm

I ran nvprof with --print-gpu-trace and --benchmark and obtained two output. As you can see, each iteration was 20x slower in the BLKSZ=32 case. How can I see anything wrong here?

BLKSZ=16

==30238== ==30238== ==30238== Profiling result:
Start Duration 278.79ms 960ns 278.93ms 546.60us 279.49ms 245.79us 279.74ms 768ns 281.93ms 36.4823s 36.7643s 2.2720us 36.7643s 704ns 36.7643s 35.6427s 72.4070s 2.0160us 72.4071s 704ns 72.4071s 37.2189s 109.626s 10.560us 109.626s 896ns 109.626s 37.7143s 147.340s 1.7280us 147.340s 704ns 147.341s 38.8855s 186.226s 1.7280us 186.226s 736ns 186.226s 38.8493s 225.075s 1.7600us 225.075s 800ns 225.075s 38.6450s 263.721s 1.7600us 263.721s 736ns 263.721s 35.9260s 299.647s 1.7280us 299.647s 704ns 299.647s 12.2829s 311.930s 1.7280us 311.930s 11.009us 311.930s 11.232us NVPROF is profiling process 30238, command: ./kinship --benchmark
Profiling application: ./kinship --benchmark
Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
- - - - - 121B 126.04MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
- - - - - 4.5929MB 8.4027GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(1631 1 1) (16 1 1) 22 0B 0B - - GeForce GTX TIT 1 7 alfreq [21]
- - - - - 60B 78.125MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [24]
- - - - - 121B 53.257MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [28]
- - - - - 121B 60.020MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [32]
- - - - - 121B 11.458MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 66.964MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [36]
- - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [40]
- - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 81.522MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [44]
- - - - - 121B 68.750MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 75.000MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [48]
- - - - - 121B 68.750MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 81.522MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [52]
- - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
(11 11 1) (16 16 1) 32 512B 0B - - GeForce GTX TIT 1 7 kinship2 [56]
- - - - - 121B 70.023MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 123.90KB 11.255GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
- - - - - 123.90KB 11.031GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

BLKSZ=32

==29653== NVPROF is profiling process 29653, command: ./kinship --benchmark
==29653== Profiling application: ./kinship --benchmark
==29653== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
217.85ms 1.0880us - - - - - 36B 33.088MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
217.99ms 652.17us - - - - - 5.0135MB 7.6875GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
218.65ms 142.95us (816 1 1) (32 1 1) 22 0B 0B - - GeForce GTX TIT 1 7 alfreq [21]
218.81ms 800ns - - - - - 60B 75.000MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
220.94ms 790.944s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [24]
791.165s 2.0160us - - - - - 36B 17.857MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
791.165s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
791.165s 818.127s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [28]
2e+03s 2.0480us - - - - - 36B 17.578MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 704ns - - - - - 60B 85.227MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]
2e+03s 213.155s (6 6 1) (32 32 1) 32 2.0480KB 0B - - GeForce GTX TIT 1 7 kinship2 [32]
2e+03s 1.7280us - - - - - 36B 20.833MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 13.152us - - - - - 147.46KB 11.212GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
2e+03s 12.480us - - - - - 147.46KB 11.815GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

little_jimmy · January 27, 2015, 4:41am

you are using the command line profiler
perhaps use the visual (gui) profiler - the analysis should be richer, and you should be able to delve deeper into your kernel’s memory accesses, etc

refer to the cuda profiler user guide

ymc · January 30, 2015, 4:24am

I am trying the Visual Profiler. But I am getting many “The data needed to calculate xxxxx could not be collected” errors. Does that mean I need to recompile my code with some sort of debug flag?

little_jimmy · January 30, 2015, 7:30am

can you post a detailed error message?

ymc · January 30, 2015, 3:17pm

Looks like these

“The data needed to calculate multiprocessor occupancy could not be collected”
“The data needed to calculate global memory load efficiency could not be collected”
“The data needed to calculate global memory store efficiency could not be collected”
“The data needed to calculate shared memory efficiency could not be collected”
“The data needed to calculate warp execution efficiency could not be collected”

Topic		Replies	Views
More threads/block increase kernel execution time. WHY? CUDA Programming and Performance	51	8772	June 17, 2011
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	53951	June 7, 2022
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	321	December 31, 2024
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6276	October 29, 2008
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10529	July 7, 2009
blocks vs threads CUDA Programming and Performance	14	18381	February 27, 2007
A block size less than 32? CUDA Programming and Performance	37	8516	December 17, 2018
Newbie help on thread blocks CUDA Programming and Performance	22	10845	December 24, 2008
An Even Easier Introduction to CUDA Technical Blog	146	7980	January 20, 2026
Performance degradation as task size grows CUDA Programming and Performance	13	771	April 25, 2023

Should I expect a speed up by increasing number of threads per block?

Thanks for your reply. I tried --ptxas-options=-v with nvcc. I got the following output. Does that mean there is a spilling?

ptxas info : Compiling entry function ‘kinship2’ for ‘sm_35’ ptxas info : Function properties for kinship2 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 39 registers, 2048 bytes smem, 380 bytes cmem[0], 12 bytes cmem[2]

BLKSZ=16

BLKSZ=32

Related topics

ptxas info : Compiling entry function ‘kinship2’ for ‘sm_35’
ptxas info : Function properties for kinship2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 2048 bytes smem, 380 bytes cmem[0], 12 bytes cmem[2]