Modifying ptc code and compiling it possible ?

[EDIT] It is about PTX (not PTC) code [/EDIT]

Here is my problem

I have a quite complex kernel which uses too much registers (38), and so the parallelism is not very good. (the occupancy calculator gives : 1 Thread Block per MP and 17% occupancy)

I tried to reduce the register uses at maximum in my code but it isn’t sufficient. I looked at the ptx where it seems that too much different registers are used when one could do many sequential things.

So I have to modify the ptx file and recompile from it.

Is it possible ?

If so, could you tell me how I have to recompile from this compilation phase and with which options ?

Thanks a lot.

Run nvcc -v on your original code and you see the compilers called along with their cmdline options.

Peter

Do note that the ptx files always seem to use way too many registers. If you count them, there is likely way more than 38. Later stages in the compilation will optimize the register count.

To check true register use look at the contents of the .cubin file.

What is the size of your threadblock? Even with 38 registers, you should be able to get 25% occupancy. So, your problem may be the shared memory usage or threadblock size.

Run your program with CUDA profiler enabled, and the check the log file for occuancy. Let us know what it claims.

Paulius

I have 128 threads per block and 24 blocks per grid. I know 24 is not multiple of 16, but with my indexing scheme, I can’t go up to 32 blocks. And the final result with 24 blocks isn’t bad.

I use about 1636o of smem and 24 regs (.cubin) per kernel.

With such parameter, the occupance calculators gives a 33% occupance (like the profiler).

In the occupancy calculator, I’m still limited by the nb of register :

Maximum Thread Blocks Per Multiprocessor Blocks

Limited by Max Warps / Multiprocessor 6

Limited by Registers / Multiprocessor 2

Limited by Shared Memory / Multiprocessor 8

Here is the profiler result for the kernel

method=[ Reconstruct_Kernel ] gputime=[ 49660.609 ] cputime=[ 49915.000 ] occupancy=[ 0.333 ]

I’ll try to tune the ptx code to optimise the register count, I think even later compilation phases cannot see exactly for every case when I need a new register or not.

I tried different simulations with the occupancy calculator and according to it I have to reach no more than 10 registers to have a 100% occupancy. at this stage the results are :

Limited by Max Warps / Multiprocessor 6

Limited by Registers / Multiprocessor 6

Limited by Shared Memory / Multiprocessor 8

Does that mean that I will trully have to have 6*16(nb MP) = 96 blocks to have a full occupancy ?

Based on register count alone, you would be able to achieve a little over 40% occupancy. You’re right, you don’t have enough treadblocks in the grid to get there.

Yes, with 128 threads/block, you have to have 96 (or a multiple of) blocks to achieve 100% occupancy. Here are the configurations that give you 100%:

  • 2 blocks x 384 threads
  • 3 blocks x 256 threads
  • 4 blocks x 192 threads
  • 6 blocks x 128 threads
  • 8 blocks x 96 threads

Paulius