I profiled the following CUDA program (which basically calculates square of first 100 numbers). The result is attached herewith as an screenshot. I have the following questions looking at the output of the profiler and my code:
1-Why is the time taken by mmcpHtoD is different from mmcpyDtoH?
2-The output shows that there are 65 branches. If I look at my code (100 threads and one block)I am unable to figure out how these numbers came?
3- To process 100 elements we will need 4 warps In all. First three warps will not be having any branch divergence, though there will be branch divergence in 4th warp (thread ID 96 onwards). Why is this Divergence information absent in the output of the profiler?
4-what is the column “instructions” talking about?
(I got the answer for question#3. My assumption was wrong.)
Thanks for your answer. Well, in my kernel, there is one if statement that is the “only” source of branching. 1 thread per block means, in each block I have threadIdx.x=0 which is actually working thread. Since at any time a warp must be running on one SM, I understand that threadIdx.x=1 to threadIdx.x=31 in each block are just doing some dummy work. If I do some manual calculation it will be as follows:
Block # 0
idx=blockIdx.x*blockDim.x+threadIdx.x;
idx=0x1+0=0 (As blockDim.x=1)
Block # 1
idx=1x1+0=1 (As blockDim.x=1)
Block # 2
idx=2x1+0=2 (As blockDim.x=1)
Block # 3
idx=3x1+0=3 (As blockDim.x=1)
.
.
.
.
.
Block # 99
idx=99x1+0=99 (As blockDim.x=1)
I am running the program on a 9600GT having 64 scalar processors.
1-Lets say SM#0 takes first 8 blocks. In each block there will be one branch due to the if statement(as we are having only one warp in each block with One thread actually working, threadIx.x=0, and 31 dummy threads). so in total 8 branches will come out on one SM#0. Since I have 64 scalar processor in total, I assume that in total there would be 8x8 =64branches.
The profiler is giving 65 branches.
Where is the 65th branch-the missing branch? :unsure: