cuda profiler reporting almost all zeros

This is my first time using the cuda visual profiler in linux, and also first time profiling a compute 2.0 device.

When I profile my code, I ONLY get values for gpu/cpu time, grid size, registers per thread, and occupancy. Most of the counters are reporting all 0. For some of them it makes sense, but not for ALL of them.

Why would gld request and gst request be 0? I’m doing everything with global memory.
Why would warps launched and sm cta launched be 0? Warps have to be launched, don’t they?

In the user guide, they explain under the section about 1.x GPUs that the counter will be zero if “The counter value is less than the number of blocks launched on the multiprocessor(s) being profiled. The normalized fractional value less than one is truncated to zero.” Does this apply to 2.0 GPUs as well? But regardless, I’m certainly performing plenty of global requests.

My kernel configuration is 15 blocks, one thread per block. Maybe that’s just too few blocks, and the mp being being profiled doesn’t get any of them… but then how do the gpu/cpu time, etc get real values?

Any help is appreciated,

Jeff

Ubuntu 9.04 (64-bit)
Visual Profiler 3.0.12

NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.36.15
GCC version: gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
Number of CUDA devices : 1
Device 0 : GeForce GTX 480

This is my first time using the cuda visual profiler in linux, and also first time profiling a compute 2.0 device.

When I profile my code, I ONLY get values for gpu/cpu time, grid size, registers per thread, and occupancy. Most of the counters are reporting all 0. For some of them it makes sense, but not for ALL of them.

Why would gld request and gst request be 0? I’m doing everything with global memory.
Why would warps launched and sm cta launched be 0? Warps have to be launched, don’t they?

In the user guide, they explain under the section about 1.x GPUs that the counter will be zero if “The counter value is less than the number of blocks launched on the multiprocessor(s) being profiled. The normalized fractional value less than one is truncated to zero.” Does this apply to 2.0 GPUs as well? But regardless, I’m certainly performing plenty of global requests.

My kernel configuration is 15 blocks, one thread per block. Maybe that’s just too few blocks, and the mp being being profiled doesn’t get any of them… but then how do the gpu/cpu time, etc get real values?

Any help is appreciated,

Jeff

Ubuntu 9.04 (64-bit)
Visual Profiler 3.0.12

NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.36.15
GCC version: gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
Number of CUDA devices : 1
Device 0 : GeForce GTX 480

So the problem was indeed the low number of blocks.

So the problem was indeed the low number of blocks.