CUDA Profile Tool Warnings

I am attempting to use the CUDA profiler tool to optimize my application. I have copied the output below. It appears that my value of warp_serialize is high, but I’m not quite sure what is causing it. I have two specific questions regarding this output, but of course any other advice is welcome.

  1. Why am I getting all these warning messages? I would like to profile my local_load, divergent_branch, etc.

  2. Besides divergent branches and shared memory bank conflicts, is there anything else that can cause the value of warp_serialize to be high?

-------------- output -----------------

NV_Warning: Signal gst_coherent can not be profiled in this run.
NV_Warning: Signal gst_incoherent can not be profiled in this run.
NV_Warning: Signal gld_32b can not be profiled in this run.
NV_Warning: Signal gld_64b can not be profiled in this run.
NV_Warning: Signal gld_128b can not be profiled in this run.
NV_Warning: Signal gld_request can not be profiled in this run.
NV_Warning: Signal local_load can not be profiled in this run.
NV_Warning: Signal local_store can not be profiled in this run.
NV_Warning: Signal branch can not be profiled in this run.
NV_Warning: Signal divergent_branch can not be profiled in this run.
NV_Warning: Signal instructions can not be profiled in this run.
NV_Warning: Signal warp_serialize can not be profiled in this run.
NV_Warning: Signal cta_launched can not be profiled in this run.

CUDA_PROFILE_LOG_VERSION 1.6

CUDA_DEVICE 1 Tesla T10 Processor

TIMESTAMPFACTOR fffff72c43210a40

timestamp,method,gputime,cputime,regperthread,occupancy,cta_
launched,warp_serialize,gld_coherent,gld_incoherent
timestamp=[ 2966.000 ] method=[ memcpyHtoD ] gputime=[ 6.176 ] cputime=[ 5.000 ]
timestamp=[ 1255162.000 ] method=[ memcpyHtoD ] gputime=[ 257933.250 ] cputime=[ 258374.031 ]
timestamp=[ 1513556.000 ] method=[ memcpyHtoD ] gputime=[ 266559.781 ] cputime=[ 266961.000 ]
timestamp=[ 1783791.000 ] method=[ memcpyHtoD ] gputime=[ 32.832 ] cputime=[ 70.000 ]
timestamp=[ 1783866.000 ] method=[ memcpyHtoD ] gputime=[ 32.800 ] cputime=[ 63.000 ]
timestamp=[ 1783929.875 ] method=[ memcpyHtoD ] gputime=[ 32.864 ] cputime=[ 63.000 ]
timestamp=[ 1783994.000 ] method=[ memcpyHtoD ] gputime=[ 32.864 ] cputime=[ 62.000 ]
timestamp=[ 1784057.000 ] method=[ memcpyHtoD ] gputime=[ 33.024 ] cputime=[ 62.000 ]
timestamp=[ 1786586.000 ] method=[ memcpyHtoD ] gputime=[ 182.144 ] cputime=[ 322.000 ]
timestamp=[ 1786912.000 ] method=[ memcpyHtoD ] gputime=[ 179.264 ] cputime=[ 322.000 ]
timestamp=[ 1787274.000 ] method=[ memcpyHtoD ] gputime=[ 4.064 ] cputime=[ 3.000 ]
timestamp=[ 1787336.000 ] method=[ Z10computeSARPfS_S_S_S_S_S_S_S_S ] gputime=[ 4039637.000 ] cputime=[ 4039645.000 ] regperthread=[ 59 ] occupancy=[ 0.250 ] cta_launched=[ 26 ] warp_serialize=[ 367774596 ] gld_coherent=[ 77185382 ] gld_incoherent=[ 0 ]
timestamp=[ 5827141.000 ] method=[ memcpyDtoH ] gputime=[ 192.384 ] cputime=[ 744.000 ]
timestamp=[ 5827888.000 ] method=[ memcpyDtoH ] gputime=[ 181.120 ] cputime=[ 703.000 ]
timestamp=[ 5828593.000 ] method=[ memcpyDtoH ] gputime=[ 4.992 ] cputime=[ 16.000 ]

  1. I don’t know the latest versions of the profiler, but I believe there are limits to the number of things you can profile in one run. So you just need multiple runs, sampling a few variables each time.

  2. Atomic ops can cause warp serialization, as can built-in transcendental functions (through conditionals in the library implementation).

Thank you for your reply. You were correct about the profiler. Multiple runs permitted me to get results for all the tests. I have posted a new question regarding shared memory bank conflicts. If you get a chance…

http://forums.nvidia.com/index.php?showtopic=170031

I have same problem. Please give me a solution