I have a Titan Black and a 192x192 parallelizable job. My kernel has two 2D char matrices that has a dimension of BLKSZ*BLKSZ.
When BLKSZ=16, I have 12x12 blocks of 256 threads per block. Since I have 15 SMXes, I launched the kernel to execute 15 blocks per iteration. It took 10 iterations and 330sec to finish.
My program was developed when I had GTX470. I heard that Titan Black now supports 32x32 threads per block, so I thought maybe increasing BLKSZ to 32 can speed up my program.
Now it does finish in three iterations. However, each iteration is 20x slower. As a result, it took 30min to finish. Overall, 6x slower. What is causing the slow down? Shouldn’t each iteration should roughly run at the same speed as before and the 32x32 implementation should be 3x faster?
it is difficult to follow your numbers - how must one relate 192x192 to BLKSZ*BLKSZ being 16 or 32 to 12x12 blocks of 256 threads per block exactly?
this makes it difficult to follow your changes
in hypothesizing about the slow-down, check whether you are not now spilling when altering the block dimension from 256 to 1024 threads
and note whether your memory accesses are still coalesced, when altering BLKSZ from 16 to 32
Thanks for your reply. The reason why I need to split my jobs into different iterations is due to the memory requirement.
When I run in the BLKSZ=16 mode, I need to allocate heap memory of 1GB for 14 blocks of 256 threads. As the program was developed back in the date when I was using GTX 470, I need to run the 192x192 jobs in ten iterations.
I am porting the code to Titan Black, it seems to work as expected when BLKSZ=16. I learned that sm3.5 allows 32x32 threads per block, so I tried doubling the BLKSZ to see if I can see improved performance. This time I should only need 4.6GB heap memory to do the job in three iterations which should be within the limit of the Titan Black. I think my memory estimation is ok because the program still ran to completion with correct result despite significant slow down.
Can you elaborate more on the terms like "spilling" and "memory accesses are coalesced"???
Thanks a lot!
i understand that you are porting the code
i follow that you are adjusting the thread block size, and BLKSZ
however, what i do not follow, is:
192x192 jobs and how it relates to 10 iterations of 14x256
the only way i can vaguely make sense of it, is if i take 10 (11) X 14 X 256, which is close to 192x192
spilling is when local memory starts to become/ behave like shared and/ or global memory, because SM registers are exhausted and insufficient to store local memory anymore
you have increased your block size by a factor of 4 (256 > 1024); hence, local memory likely increased by a factor of 4, increasing the chance of spilling
[compile with the ptxas -v flag to note register usage/ spilling statistics]
also, when you alter BLKSZ, and if the 2D array you mentioned, connected to BLKSZ is stored in global memory, altering BLKSZ would alter the indexing/ offset of the array in global memory, in turn potentially affecting global memory read coalescence
[profiling should reveal to what degree your memory reads are coalesced]
I ran nvprof with --print-gpu-trace and --benchmark and obtained two output. As you can see, each iteration was 20x slower in the BLKSZ=32 case. How can I see anything wrong here?
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
you are using the command line profiler
perhaps use the visual (gui) profiler - the analysis should be richer, and you should be able to delve deeper into your kernel’s memory accesses, etc
I am trying the Visual Profiler. But I am getting many “The data needed to calculate xxxxx could not be collected” errors. Does that mean I need to recompile my code with some sort of debug flag?
“The data needed to calculate multiprocessor occupancy could not be collected”
“The data needed to calculate global memory load efficiency could not be collected”
“The data needed to calculate global memory store efficiency could not be collected”
“The data needed to calculate shared memory efficiency could not be collected”
“The data needed to calculate warp execution efficiency could not be collected”