Grid3D Pb parallelism - CUDA - GTX690

Hello,

I launch a small program to test the parallelism of the calculations

My kernel grid is 3D, and i launch a simple instruction

Here my results

-blocksPerGrid (16,16,1) and threadsPerBlock (32,32,1) -> 16 * 16 * 1 * 1024 = 262144 Threads, KernelTime -> 20 ms
-blocksPerGrid (16,16,2) and threadsPerBlock (32,32,1) -> 16 * 16 * 2 * 1024 = 524288 Threads, KernelTime -> 40 ms
-blocksPerGrid (16,16,3) and threadsPerBlock (32,32,1) -> 16 * 16 * 3 * 1024 = 786432 Threads, KernelTime -> 60 ms
-blocksPerGrid (16,16,4) and threadsPerBlock (32,32,1) -> 16 * 16 * 4 * 1024 = 1048576 Threads, KernelTime -> 80 ms
-….

I don’t understand these results, why this sequentiel treatment ?
What is this latency ? The blocks execution isn’t a parallel treatment ?

My GTX 690 has 1536 Cuda cores by GPU and 2 GPUs

Thanks for your responses

Here my global method :

__global__ void computeDetection(int numberAircraft, int *deviceArrayC)
		{
			int indiceSlave = blockDim.y * blockIdx.y + threadIdx.y;
			int indiceMaster = blockDim.x * blockIdx.x + threadIdx.x;
			int indiceSample = blockIdx.z;

			int localIdx = indiceMaster * numberAircraft + indiceSlave;

			if (indiceMaster <= indiceSlave)
				return;
			if (indiceMaster > numberAircraft)
				return;
			deviceArrayC[localIdx] = -1;
			int tmp = 10000;
			for (int indVolumeMaster = 0; indVolumeMaster < 364; indVolumeMaster++)
			{
				for (int indVolumeSlave = 0; indVolumeSlave < 364; indVolumeSlave++)
				{
					tmp = (tmp < indVolumeSlave) ? tmp : indVolumeSlave;

				}
			}
			deviceArrayC[localIdx * 120 + indiceSample] = tmp;
		}

It looks like your main issue is that you’re writing almost completely uncoalesced (depedning on numberAircraft) to global memory.

The SMX executes warps along the x-dimensions ( threadIdx.x ), 32 at a time per warp. Hence your localIdx will be writing to global at for example 0, 32, 64, 96, 128, … (numberAircraft=32) which is not coalesced.

Can you try changing to:

int indiceSlave = blockDim.x * blockIdx.x + threadIdx.x;
int indiceMaster =  blockDim.y * blockIdx.y + threadIdx.y;

This will atleast make

deviceArrayC[localIdx] = -1;

behave more coalesced.

Now

deviceArrayC[localIdx * 120 + indiceSample] = tmp;

will still be an issue with this modification.

Could you modify this to output in a transposed manner?

deviceArrayC[indiceSample * 120 + localIndex] = tmp;

and then add another Kernel to transpose the data back into correct dimensions?

Let me know how that works out. :-)