Since my application runs slower in gtx 470, I thought about run it in profiler. The binary is compiled months ago with cuda 2.1 (or 2.2). However, the items showed in the profiler is changed and the values changed as well.
GTX 470:
method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype local_load local_store gld_request gst_request shared_load shared_store branch warps_launched active_cycles sm_cta_launched l1_global_load_hit divergent_branch l1_global_load_miss inst_issued inst_executed threads_launched active_warps
Kernel 3821.92 3834.11 0.667 127 1 128 1 1 0 0 21 0 0 55476 55476 55476 55476 55476 55476 19512 55476 55476 9 16584 0 92442 1786040 1707588 1152 46693123
GTX 295:
method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memtransferhostmemtype branch divergent_branch instructions warp_serialize cta_launched local_load local_store gld_32b gld_64b gld_128b gst_32b gst_64b gst_128b tex_cache_hit tex_cache_miss
Kernel 2798.69 2813.74 1 127 1 128 1 1 0 44 15 0 0 33168 0 784324 0 13 624 1248 72192 86016 82944 24064 1207584 27648 0 0
So the GTX 295 are about 30% faster. It only use 15 registers per thread but it also use 44 static shared memory per block, whereas the GTx 470 use 21 registers and 0 static share memory per block. GTX 470 profiler report doesn’t have those gld and gst, which I believe is due to the L1 cache (But how about loading data from device memory into L2 cache? is there anywhere to show how many data get loaded into cache?) For our code, we have good coalesced access (but not necessary aligned), as you can see from the gld and gst field in GTX 295. However, those data are used only once and the access pattern are scattered so L1 cache doesn’t help us, as shown in the GTX 470 result where we have 92442 L1 misses but only 16584 L1 hits (probably at sacrfice of loading extra data and underuse the bandwidth).
What concerns me is the 55476 local load and local store in GTx 470 vs. only 624 local load and 1248 local store in GTX 295. I suspect it is what makes our program slower. So I am wondering, what makes the local memory load and store increase? note that the registers per thread also increased, what causes that? Remember we didn’t recompile the code, it is still the one we compiled weeks ago using cuda2.1. Why do we get different register usage per thread running on different card?