Performance problem using comparator

The following CUDA code is executed in more than 2 seconds (2260.611084 ms):

global void computeDetection(int numberAircraft, int nbSampleByAircraft, float distRefMin, int altitudeRefMin)
{
int indiceMaster = blockDim.x * blockIdx.x + threadIdx.x;
int indiceSlave = blockDim.y * blockIdx.y + threadIdx.y;
int localIdx = indiceMaster * numberAircraft + indiceSlave;

if (indiceMaster <= indiceSlave)
	return;

	if (indiceMaster > numberAircraft)
		return;

deviceArrayC[localIdx] = -1;
int tmp = 10000;
for (int indSample = 0; indSample < 120; indSample++)
{
	for (int indVolumeMaster = 0; indVolumeMaster < 364; indVolumeMaster++)
	{
		for (int indVolumeSlave = 0; indVolumeSlave < 364; indVolumeSlave++)
		{
			// tmp = deviceArrayC[localIdx] - indVolumeSlave;
			tmp = (indVolumeSlave < tmp) ? indVolumeSlave : tmp;
		}
	}		
}

deviceArrayC[localIdx] = tmp;

}

But if I replace “tmp = (indVolumeSlave < tmp) ? indVolumeSlave : tmp;” with “tmp = deviceArrayC[localIdx] - indVolumeSlave;” (switch commented line with the next one), then the code is executed in 0.052192 ms.

Can anybody explain me why I see such a difference of performance ?

Environment: Linux redhat 6.2, Cuda toolkit 5.0, last drivers downloaded for GTX 690.
And here my command line nvcc :
/usr/local/cuda-5.0/bin/nvcc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 …

Thanks in advance.

The first thing I would do is add some cuda error checking after that kernel (cudeaGetLastError). That big a drop in runtime (for what looks to be a very time consuming kernel) screams out at me to be a launch failure.

Do you really want to do nearly 16 million comnparisions per thread?

Hello Tiomat,

I have isolate the pb … In fact, the code is more complex. Many tests are performed on my collision calcul without pb …
But with the following expression “tmp = (indVolumeSlave < tmp) ? indVolumeSlave : tmp;” , the performance are degraded …

I don’t know why …

Here, my dump execution (degraded case) :
nbAircraft : 500 - nbSampleByAircraft : 120 - distRefMin : 10000.000000 - altitudeRefMin : 1000.000000
Time of cudaMemcpy before to call kernel : 278 usec
Time of cudaMemset before to call kernel : 3 usec
blocksPerGrid : 16 - threadsPerBlock : 32
Time of kernel : 2241.072998 ms
An error Cuda1 is raised : no error -> cudaGetErrorString(cudaGetLastError())
An error Cuda2 is raised : no error -> cudaGetErrorString(cudaThreadSynchronize())

And nominal case (without tmp = (indVolumeSlave < tmp) ? indVolumeSlave : tmp;)

nbAircraft : 500 - nbSampleByAircraft : 120 - distRefMin : 10000.000000 - altitudeRefMin : 1000.000000
Time of cudaMemcpy before to call kernel : 277 usec
Time of cudaMemset before to call kernel : 3 usec
blocksPerGrid : 16 - threadsPerBlock : 32
Time of kernel : 0.039584 ms
An error Cuda1 is raised : no error -> cudaGetErrorString(cudaGetLastError())
An error Cuda2 is raised : no error -> cudaGetErrorString(cudaThreadSynchronize())

Thank you

Hmm well if there are no errors then the next thing that I would look at is whether the compiler is aggressively optimising out large chunks of your code.

If you look at your non-degraded case, your for loops are essentially redundant as it ends up being

tmp = deviceArrayC[localIdx] - 363;

If the compiler figures that out it may completely remove all your for loops giving you a kernel that is drastically different.

The two different ways I can see of testing this is to remove GPU optimisation on compilation step and re-run both performance tests (both tests will be slower, but the test is whether they are within an order of magnitude of each other), or to alter your non-degraded case to have dependancies on previous values. This should be easy to do by changing your non-degraded test case from

tmp = deviceArrayC[localIdx] - indVolumeSlave;

to something like

tmp = deviceArrayC[localIdx] - (tmp/1000000.0f);

I keep focussing on this as the 2000ms timing seems far more reasonable for a kernel like that than 50 microseconds. Kernels of that kind of timings are generally very simplistic.

Hello Tiomat,

I reduced the iteration number for a cuda core (loop 120 removed)…
I launch now the kernel with 3D Grid

the Cuda core performs less iteration …

here my result :
blocksPerGrid (16,16,1) and threadsPerBlock (32,32,1) -> 16 * 16 * 1 * 1024 = 262144 Threads, KernelTime -> 20 ms
blocksPerGrid (16,16,2) and threadsPerBlock (32,32,1) -> 16 * 16 * 2 * 1024 = 524288 Threads, KernelTime -> 40 ms
blocksPerGrid (16,16,3) and threadsPerBlock (32,32,1) -> 16 * 16 * 3 * 1024 = 786432 Threads, KernelTime -> 60 ms
blocksPerGrid (16,16,4) and threadsPerBlock (32,32,1) -> 16 * 16 * 4 * 1024 = 1048576 Threads, KernelTime -> 80 ms
….

I don’t understand these results, why this sequentiel treatment ?
What is this latency ? The blocks execution isn’t a parallel treatment ?

My GTX 690 has 1536 Cuda cores by GPU and 2 GPUs

here, global method :
global void computeDetection(int numberAircraft, int nbSampleByAircraft, float distRefMin, int altitudeRefMin)
{
int indiceSlave = blockDim.y * blockIdx.y + threadIdx.y;
int indiceMaster = blockDim.x * blockIdx.x + threadIdx.x;
int indiceSample = blockIdx.z;

int localIdx = indiceMaster * numberAircraft + indiceSlave;

if (indiceMaster <= indiceSlave)
	return;
	if (indiceMaster > numberAircraft)
		return;
deviceArrayC[localIdx] = -1;
int tmp = 10000;
for (int indVolumeMaster = 0; indVolumeMaster < 364; indVolumeMaster++)
{
	for (int indVolumeSlave = 0; indVolumeSlave < 364; indVolumeSlave++)
	{
		tmp = (tmp < indVolumeSlave) ?  tmp : indVolumeSlave;
		
	}
}
deviceArrayC[localIdx * 120 + indiceSample] = tmp;

}

Hi,

That is quite a simple answer. Basically your GPU cannot run infinite threads concurrently. The amount of work it can do in parallel is limited by the amount of hardware it has, which means that once you are above that amount you are still having to do work in series, but of chunks at a time. So what are you seeing now, doubling work needing to be done takes double the amount of time, is exactly what you would expect to see.

Think of it like this analogy. The GPU is like a bus, and the CPU is more like a car. It can do lots of work at once (carry lots of people) but is a bit slower than a car. In a certain amount of time it can carry more people to the same place, but only to the same place. In this analogy what you are trying to do is take a thousand people somewhere. They simply cannot fit in the bus in one trip, so it has to do multiple trips. When you double the number of people, you double the number of trips.

I hope that helps.