cuda profiler result of same program under GTX295 and GTX470 register per thread, local load, local

Since my application runs slower in gtx 470, I thought about run it in profiler. The binary is compiled months ago with cuda 2.1 (or 2.2). However, the items showed in the profiler is changed and the values changed as well.

GTX 470:
method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype local_load local_store gld_request gst_request shared_load shared_store branch warps_launched active_cycles sm_cta_launched l1_global_load_hit divergent_branch l1_global_load_miss inst_issued inst_executed threads_launched active_warps

Kernel 3821.92 3834.11 0.667 127 1 128 1 1 0 0 21 0 0 55476 55476 55476 55476 55476 55476 19512 55476 55476 9 16584 0 92442 1786040 1707588 1152 46693123

GTX 295:
method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype branch divergent_branch instructions warp_serialize cta_launched local_load local_store gld_32b gld_64b gld_128b gst_32b gst_64b gst_128b tex_cache_hit tex_cache_miss

Kernel 2798.69 2813.74 1 127 1 128 1 1 0 44 15 0 0 33168 0 784324 0 13 624 1248 72192 86016 82944 24064 1207584 27648 0 0

So the GTX 295 are about 30% faster. It only use 15 registers per thread but it also use 44 static shared memory per block, whereas the GTx 470 use 21 registers and 0 static share memory per block. GTX 470 profiler report doesn’t have those gld and gst, which I believe is due to the L1 cache (But how about loading data from device memory into L2 cache? is there anywhere to show how many data get loaded into cache?) For our code, we have good coalesced access (but not necessary aligned), as you can see from the gld and gst field in GTX 295. However, those data are used only once and the access pattern are scattered so L1 cache doesn’t help us, as shown in the GTX 470 result where we have 92442 L1 misses but only 16584 L1 hits (probably at sacrfice of loading extra data and underuse the bandwidth).
What concerns me is the 55476 local load and local store in GTx 470 vs. only 624 local load and 1248 local store in GTX 295. I suspect it is what makes our program slower. So I am wondering, what makes the local memory load and store increase? note that the registers per thread also increased, what causes that? Remember we didn’t recompile the code, it is still the one we compiled weeks ago using cuda2.1. Why do we get different register usage per thread running on different card?

Do you consider use more recent version of sdk? Could you rewrite report in line way? Btw, do you use special function units? And what is local store?

Sorry for the messy format of the profiler output.

GTX 470

method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ

Kernel_ 3821.92_3834.11_0.667_____127 ______1_______128________1_______1

dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype

0____________________0_____________21____________0__________

0

local_load local_store gld_request gst_request shared_load shared_store branch warps_launched active_cycles sm_cta_launched

55476______55476_____55476_____55476_____55476______55476___

_19512______55476________55476______9

l1_global_load_hit divergent_branch l1_global_load_miss inst_issued inst_executed threads_launched active_warps

16584_______________0______________92442________1786040___17

07588_________1152______46693123

GTX 295:

method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ

Kernel__2798.69_2813.74 ___1______127_______1______128_______1_________1

dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype

0________________44_____________15____________0___

__0

branch divergent_branch instructions warp_serialize cta_launched local_load local_store

33168_______0__________784324______0____________13_______624

_____1248

gld_32b gld_64b gld_128b gst_32b gst_64b gst_128b tex_cache_hit tex_cache_miss

72192__86016__82944___24064_1207584_27648_______0___________

0

Hope this is better.

We do use a few exponential calculation. I have no idea what that Local_load and Local_store in cuda profiler means. Does it mean it saves local variable instead of keep it in register? We will compile with toolkit 3.0, need to get visual studio updated first.

It will be interesting co compare. But you can compile with fast math switch to get rid of exp functions to check if they are bottle neck and using local memory.
Looks lime you have same number of threads in gtx295 and gtx 470. Do you use a lot of divisions and square roots?

No, that doesn’t quite work. I only did a few exp() and divisions. After use the fast match switch, I can see the instruction decreased and kernel runs a little bit faster, but only 1 or 2 percent at most.

After remove portion of code inside kernel to diagnose, we finally discovered that it is because some calculation is performed in double precision. We declared every variable to be float, but there are some constant that we hard coded, like 1.0. Which seems to compiled as double precision calculation. Interestingly, it doesn’t impact the performance in GTX 295. After we change all numerical notation with 1.0f. It calculates much faster. Now the 470 can is about 15% faster than a single GTX 295 GPU.

Now, I am wondering, is there any command switch to force all floating point calculation to be single precision? I know the arch=sm11 will do this, but I worry it will also turn off other capabilities that is desirable.

Btw, this float constant thing was mentioned in programming guide. Ah, you may try to turn on flush to zero mode, as it was default in gt200, maybe it helps too.