I began to optimize my program with CUDA profiler but encountered unexpected problem.
All counters are reported to be equal to 0. OS - Vista64, CUDA 2.3, the profiled program platform - Win32.
cudaThreadExit is called in the end of the program. Kernel calls are reported. They, of cause, access global data.
I requested all counters so that the program was launched 5 times.