I have done a simple test to check how much calculation can be done while other threads are waiting for data from device memory. To do it, I’ve prepared a simple code:

```
__global__ void kernel(float* d_in, float* d_out)
{
register int index = blockIdx.x * blockDim.x + threadIdx.x;
register float a = d_in[4*index];
register float b = d_in[4*index+1];
register float c = d_in[4*index+2];
register float d = d_in[4*index+3];
a =b*c; //
b =c+d; // math
c =a*b; // operations
d =a+c; //
d_out[4*index] = a;
d_out[4*index+1] = b;
d_out[4*index+2] = c;
d_out[4*index+3] = d;
}
void launch_kernel(float* d_in, float* d_out)
{
int nBlocks = 10000;
int threadsPerBlock = 512;
kernel<<<nBlocks, threadsPerBlock>>>(d_in,d_out);
cudaDeviceSynchronize();
}
```

In lines 10-13 there are two floating point multiply and two add operations.

Firsly I’ve check the time of GPU execution by CUDA profiler, it takes 1463 us.

Next I’ve changed only this fragment of code (lines 10-13).

I’ve comment this fragment and the kernel execution takes the same time.

I’ve thought OK! The calculation are mask by the time of data transfer from the device memory.

Then I’ve started to add more calculations by copying lines 10-13, and the results are surprising me!

Giving more calculation caused faster execution time, and until there was more then 12 add and multiply operations the time was shorter than without math at all!

Later it is naturally, more operation -> more time.

____________No. of floating point operations

Test No.________add___________multiply______________GPU time [µs]

1________________0_______________0______________________1462

2________________2_______________2______________________1462

3________________4_______________4______________________1430

4________________6_______________6______________________1383

5_______________10______________10______________________1335

6_______________16______________16______________________1549

7_______________20______________20______________________1690

If you know why this time is going down, please let me know.

Thank you very much for response:)

Martin

PS. Test was performed on GeForce GTX 580 with CUDA v4.0