Profiler output not consistent

Hi,

I profiled the following CUDA program (which basically calculates square of first 100 numbers). The result is attached herewith as an screenshot. I have the following questions looking at the output of the profiler and my code:

1-Why is the time taken by mmcpHtoD is different from mmcpyDtoH?

2-The output shows that there are 65 branches. If I look at my code (100 threads and one block)I am unable to figure out how these numbers came?

3- To process 100 elements we will need 4 warps In all. First three warps will not be having any branch divergence, though there will be branch divergence in 4th warp (thread ID 96 onwards). Why is this Divergence information absent in the output of the profiler?

4-what is the column “instructions” talking about?

Thank you very much in advance

#include <stdio.h>

#include <cuda.h>

__global__ void square_array(float*a,int N)

{

int idx=blockIdx.x*blockDim.x+threadIdx.x;

if(idx<N)a[idx]=a[idx]*a[idx];

}

int main(void)

{

float*a_h,*a_d;	

const int N=100;

size_t size=N*sizeof(float);

a_h=(float*)malloc(size);

cudaMalloc((void**)&a_d,size);

for(int i=0;i<N;i++)a_h[i]=(float)i;

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

square_array<<<100, 1>>>(a_d,N);

cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);

for(int i=0;i<N;i++)

printf("%d\t%f\n",i,a_h[i]);

free(a_h);

cudaFree(a_d);

}


Actually you’re running 100 blocks with one thread in each. that probably explains the branches (and in anycase not a good thing to do).

As for the mmcpHtoD and mmcpyDtoH - its not a must they should be the same. Furthermore the profiler might actually add the kernel run time

to the second copy since you dont have cudaThreadSynchronize (this is a guess, maybe try to verify it).

number of operations done by the SMs which are not memory related.

eyal

(I got the answer for question#3. My assumption was wrong.)

Thanks for your answer. Well, in my kernel, there is one if statement that is the “only” source of branching. 1 thread per block means, in each block I have threadIdx.x=0 which is actually working thread. Since at any time a warp must be running on one SM, I understand that threadIdx.x=1 to threadIdx.x=31 in each block are just doing some dummy work. If I do some manual calculation it will be as follows:

Block # 0

idx=blockIdx.x*blockDim.x+threadIdx.x;

idx=0x1+0=0 (As blockDim.x=1)

Block # 1

idx=1x1+0=1 (As blockDim.x=1)

Block # 2

idx=2x1+0=2 (As blockDim.x=1)

Block # 3

idx=3x1+0=3 (As blockDim.x=1)

      .

      .

      .

      .

      .

Block # 99

idx=99x1+0=99 (As blockDim.x=1)

I am running the program on a 9600GT having 64 scalar processors.

1-Lets say SM#0 takes first 8 blocks. In each block there will be one branch due to the if statement(as we are having only one warp in each block with One thread actually working, threadIx.x=0, and 31 dummy threads). so in total 8 branches will come out on one SM#0. Since I have 64 scalar processor in total, I assume that in total there would be 8x8 =64branches.

The profiler is giving 65 branches.

Where is the 65th branch-the missing branch? :unsure: