Simple test, unexpected results: more calculations in each thread, less GPU occupancy time!

I have done a simple test to check how much calculation can be done while other threads are waiting for data from device memory. To do it, I’ve prepared a simple code:

__global__ void kernel(float* d_in, float* d_out)
	{
	    register int index = blockIdx.x * blockDim.x + threadIdx.x;

	    register float a = d_in[4*index];
	    register float b = d_in[4*index+1];
	    register float c = d_in[4*index+2];
	    register float d = d_in[4*index+3];

	    a =b*c;			// 
	    b =c+d;			// math
	    c =a*b;			// operations
	    d =a+c;			//

	    d_out[4*index] = a; 
	    d_out[4*index+1] = b;
	    d_out[4*index+2] = c; 
	    d_out[4*index+3] = d;
	}

	void launch_kernel(float* d_in, float* d_out)
	{
	    int nBlocks = 10000;
	    int threadsPerBlock = 512; 

	    kernel<<<nBlocks, threadsPerBlock>>>(d_in,d_out);

	    cudaDeviceSynchronize();
	}

In lines 10-13 there are two floating point multiply and two add operations.
Firsly I’ve check the time of GPU execution by CUDA profiler, it takes 1463 us.
Next I’ve changed only this fragment of code (lines 10-13).
I’ve comment this fragment and the kernel execution takes the same time.
I’ve thought OK! The calculation are mask by the time of data transfer from the device memory.
Then I’ve started to add more calculations by copying lines 10-13, and the results are surprising me!
Giving more calculation caused faster execution time, and until there was more then 12 add and multiply operations the time was shorter than without math at all!
Later it is naturally, more operation -> more time.

____________No. of floating point operations
Test No.________add___________multiply______________GPU time [µs]
1________________0_______________0______________________1462
2________________2_______________2______________________1462
3________________4_______________4______________________1430
4________________6_______________6______________________1383
5_______________10______________10______________________1335
6_______________16______________16______________________1549
7_______________20______________20______________________1690

If you know why this time is going down, please let me know.

Thank you very much for response:)

Martin

PS. Test was performed on GeForce GTX 580 with CUDA v4.0

Have you been testing your code on CUDA 5.0 driver? From my point of view this code looks ok.

I suggest to test your code on the newest driver and if it is possible on the other GPU.

Ok. I did it on the other computer, equipped with similar hardware.
The previous test was done on: GeForce CUDA GTX 580 MSI 3072MB Lightning XE
with CUDA v4.0

Today, I did the same test on: GeForce CUDA GTX 580 OC MSI 1536MB TWINFROZ
and with the newest CUDA v5.0.

________No. of floating point operations_____CUDAv4.0_________CUDAv5.0
Test No.______add___________multiply________GPU time [µs]____GPU time [µs]
1_____________0_______________0_________________1462____________1459
2_____________2_______________2_________________1462____________1453
3_____________4_______________4_________________1430____________1454
4_____________6_______________6_________________1383____________1453
5____________10______________10_________________1335____________1446
6____________16______________16_________________1549____________1439
7____________20______________20_________________1690____________1434
7____________28______________28_________________2080____________1427
8____________56______________56_________________________________1461

It looks different, the tendency is the same. Firstly this time is going
down and later it grows up.

Thank you for comments:)

Martin

Note that your code is bandwidth-bound until you use too much arithmetic. This means that the time is not really expected to go up right away - first, it should stay flat, and then it should go up. But in your case it does go down first.

Such things happen when you get too much concurrency in the memory system, such as too many outstanding memory transactions at the same time. For example, you can see this effect on throughput-versus-occupancy curves for memory-bound kernels that have lots of ILP. They hit the peak at relatively small occupancy and then run slower at occupancy that is higher than that.

I bet something similar happens in your case. Increasing arithmetic intensity you reduce the number of outstanding memory transactions, which means lightening the load on the memory system. You have less contention and the traffic goes smoother - therefore it runs faster.

Hi there vvolkov,
Could You elaborate the matter of ‘too many outstanding memory transactions’ and ‘memory-bound kernels that have lots of ILP’, or at least provide some reference about it? I must say not everything what You mention is known to me, while seems much interesting.

Cheers,
MK

Hi,
Firstly, thank you for your answers:)
For me, explanation written by vvolkov seems correct.
I did additional test with barrier for thread synchronization, which confirm it:)

The FIRST Kernel definition is as follows:

  1. taking data from device memory,
    __syncthreads();
  2. simple arithmetic operations which do not need synchronization,
  3. arithmetic which need synchronization.

and the SECOND Kernel:

  1. taking data from device memory,
  2. simple arithmetic operations which do not need synchronization,
    __syncthreads();
  3. arithmetic which need synchronization.

The second kernel is faster! I think that it is exactly the same mechanism: in the first case I have much concurrency in the memory system, while in the second thanks to the arithmetic, the memory traffic goes smoother.

Thank you vvolkov!!!:)
Thanks Guys.

Best wishes!

Martin