Hi,
When compiled with CUDA 2.2, my code runs at half the speed that it does when compiled with CUDA 2.1. Here are exerpts from the profiler logs.
CUDA 2.1:
timestamp method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memTransferDir gld_incoherent gld_coherent gst_incoherent gst_coherent local_load local_store branch divergent_branch instructions warp_serialize cta_launched
433463 _Z10cuda_4pipejjjjjjjjPhS_ 74317.1 74348 0.25 65535 1 64 1 1 0 64 37 4 0 0 0 0 0 0 163802 0 24717049 0 10922
582794 memcopy 4.9 242 1 1
583662 _Z10cuda_4pipejjjjjjjjPhS_ 74302.2 74316 0.25 65535 1 64 1 1 0 64 37 4 0 0 0 0 0 0 163858 0 24718589 0 10922
732528 memcopy 3.81 12 1 1
CUDA 2.2:
timestamp method gputime cputime occupancy gridSizeX gridSizeY blockSizeX blockSizeY blockSizeZ dynSmemPerBlock staSmemPerBlock registerPerThread streamID memTransferSize memTransferDir gld_incoherent gld_coherent gst_incoherent gst_coherent local_load local_store branch divergent_branch instructions warp_serialize cta_launched
455903 _Z10cuda_4pipejjjjjjjjPhS_ 147923 148007 0.25 65535 1 64 1 1 0 64 37 0 0 0 0 0 0 0 316738 0 49356931 0 10922
623450 _Z10cuda_4pipejjjjjjjjPhS_ 147917 147995 0.25 65535 1 64 1 1 0 64 37 0 0 0 0 0 0 0 316738 0 49353107 0 10922
In particular, note that the number of instructions performed has doubled. Any suggestions on how I go about determining why?
Thanks!