Questions about Visual Profiler


I have questions with the following code and the result image.

  1. Why can’t I get same result for the func2 as the func1?

    As you see, func1 and func2 are same.

  2. What do the GPU time and the CPU time mean? (Why do they come together?)

    I guess that when a kernel function runs, CPU time should be zero because CPU stops and waits for GPU to finish…

Could anybody tell me why?

I’m using WindowsXP 64bit, NVIDIA Driver Ver.190.38, GTX285 and 9800 GTX (for display).

Best regards,


global void func1( float* opt, float* ipt )


unsigned int tid = threadIdx.x;

opt[tid] = ipt[tid];



global void func2( float* opt, float* ipt )


unsigned int tid = threadIdx.x;

opt[tid] = ipt[tid];




main( int argc, char** argv)


unsigned int i;

unsigned int cnt = 100;

float *inp_h, *inp_d;

float *opt_h, *opt_d;

inp_h = (float *)malloc( cnt * sizeof(float) );		for(i = 0; i < cnt; i++) inp_h[i] = (float)i;

opt_h = (float *)malloc( cnt * sizeof(float) );		for(i = 0; i < cnt; i++) opt_h[i] = 0.0;

cudaSetDevice( 0 );

cudaMalloc( (void**) &inp_d, cnt * sizeof(float) );

cudaMalloc( (void**) &opt_d, cnt * sizeof(float) );

cudaMemcpy( inp_d, inp_h, cnt * sizeof(float), cudaMemcpyHostToDevice );

cudaMemcpy( opt_d, opt_h, cnt * sizeof(float), cudaMemcpyHostToDevice );

func1<<< 1, cnt >>>( opt_d, inp_d );


func2<<< 1, cnt >>>( opt_d, inp_d );


cudaMemcpy( opt_h, opt_d, cnt * sizeof(float), cudaMemcpyDeviceToHost );



If I were you, I would try to do a double run of your func1, and (if possible) omit the first run from your testing case.

Often when you run something for the first time, cache, buffers etc. won’t be initialized. Your second run will give you an indication of how it will perform on a large problem or in continuous mode.

CUDA toolboxes you can buy often recommend doing “gpu warm-up” before performance testing to avoid this exact issue.



Hi Henrik,

Thank you for your reply. I agree with you and I found out another reason.

The reason is that Visual Profiler didn’t observe the SM which was used by func2 because of the following conditions.

1, Visual Profiler observes only TPC#0 and SM#0 in TPC#0.

2, In a CUDA session, all thread blocks are used one after another.

For example.

If I call a function like “func<<< 8, x >>>(…)” 2 times, the thread blocks will be assigned like the attached image.

Yellow SMs are used by the first call.

Green SMs are used by the second call.

img.bmp (348 KB)

In the second call, counters for TPC#0 returns values, but counters for SM#1 in TPC#0 doesn’t return values.

Because Visual Profiler doesn’t see the SM.

(There are 10 TPCs in a GT200 core. Each TPC has 3 SMs)