I am attempting to use the CUDA profiler tool to optimize my application. I have copied the output below. It appears that my value of warp_serialize is high, but I’m not quite sure what is causing it. I have two specific questions regarding this output, but of course any other advice is welcome.
-
Why am I getting all these warning messages? I would like to profile my local_load, divergent_branch, etc.
-
Besides divergent branches and shared memory bank conflicts, is there anything else that can cause the value of warp_serialize to be high?
-------------- output -----------------
NV_Warning: Signal gst_coherent can not be profiled in this run.
NV_Warning: Signal gst_incoherent can not be profiled in this run.
NV_Warning: Signal gld_32b can not be profiled in this run.
NV_Warning: Signal gld_64b can not be profiled in this run.
NV_Warning: Signal gld_128b can not be profiled in this run.
NV_Warning: Signal gld_request can not be profiled in this run.
NV_Warning: Signal local_load can not be profiled in this run.
NV_Warning: Signal local_store can not be profiled in this run.
NV_Warning: Signal branch can not be profiled in this run.
NV_Warning: Signal divergent_branch can not be profiled in this run.
NV_Warning: Signal instructions can not be profiled in this run.
NV_Warning: Signal warp_serialize can not be profiled in this run.
NV_Warning: Signal cta_launched can not be profiled in this run.
CUDA_PROFILE_LOG_VERSION 1.6
CUDA_DEVICE 1 Tesla T10 Processor
TIMESTAMPFACTOR fffff72c43210a40
timestamp,method,gputime,cputime,regperthread,occupancy,cta_
launched,warp_serialize,gld_coherent,gld_incoherent
timestamp=[ 2966.000 ] method=[ memcpyHtoD ] gputime=[ 6.176 ] cputime=[ 5.000 ]
timestamp=[ 1255162.000 ] method=[ memcpyHtoD ] gputime=[ 257933.250 ] cputime=[ 258374.031 ]
timestamp=[ 1513556.000 ] method=[ memcpyHtoD ] gputime=[ 266559.781 ] cputime=[ 266961.000 ]
timestamp=[ 1783791.000 ] method=[ memcpyHtoD ] gputime=[ 32.832 ] cputime=[ 70.000 ]
timestamp=[ 1783866.000 ] method=[ memcpyHtoD ] gputime=[ 32.800 ] cputime=[ 63.000 ]
timestamp=[ 1783929.875 ] method=[ memcpyHtoD ] gputime=[ 32.864 ] cputime=[ 63.000 ]
timestamp=[ 1783994.000 ] method=[ memcpyHtoD ] gputime=[ 32.864 ] cputime=[ 62.000 ]
timestamp=[ 1784057.000 ] method=[ memcpyHtoD ] gputime=[ 33.024 ] cputime=[ 62.000 ]
timestamp=[ 1786586.000 ] method=[ memcpyHtoD ] gputime=[ 182.144 ] cputime=[ 322.000 ]
timestamp=[ 1786912.000 ] method=[ memcpyHtoD ] gputime=[ 179.264 ] cputime=[ 322.000 ]
timestamp=[ 1787274.000 ] method=[ memcpyHtoD ] gputime=[ 4.064 ] cputime=[ 3.000 ]
timestamp=[ 1787336.000 ] method=[ Z10computeSARPfS_S_S_S_S_S_S_S_S ] gputime=[ 4039637.000 ] cputime=[ 4039645.000 ] regperthread=[ 59 ] occupancy=[ 0.250 ] cta_launched=[ 26 ] warp_serialize=[ 367774596 ] gld_coherent=[ 77185382 ] gld_incoherent=[ 0 ]
timestamp=[ 5827141.000 ] method=[ memcpyDtoH ] gputime=[ 192.384 ] cputime=[ 744.000 ]
timestamp=[ 5827888.000 ] method=[ memcpyDtoH ] gputime=[ 181.120 ] cputime=[ 703.000 ]
timestamp=[ 5828593.000 ] method=[ memcpyDtoH ] gputime=[ 4.992 ] cputime=[ 16.000 ]