Loss of performance when display is disabled Unexplained loss of performance

I have just stared developing with CUDA and still have a lot to learn, but desperately need help for my unexplained problem.

Here’s the situation:

As a first try with CUDA, I have developed a Game Of Life. Everything works fine. However, when trying to collect information on execution time and number of operations per second a strange thing happened.

I am currently computing the total execution time of the kernel only, not the whole program. What is strange is that if I disable (in the code) everything that’s got to do with displaying the game and compute solely the kernel execution time, my performances drop by a factor 10%… Whereas when I leave all the displaying, the performances are better, which is conter-intuitive…

Is there something I didn’t get about NVIDIA’s architecture?

The displaying is done updating, in the kernel, a VBO with current data.

Any help appreciated.

At the first call for a CUDA method the device will be initialized. This initialization takes some time.

If you were only timing the kernel at first, and before that some other cuda methods were called then the device is initialized outside of the timer. But now that you deleted some parts it can be that the initialization is done within the part you are timing. And this can cause some delays.

Thank you for this prompt reply. I completely understand what your are saying, however there is something I still don’t understand.

Here is part of the Game Of Life complete code (the one with the best performance):

void runCuda( GLuint vbo)


// map OpenGL buffer object for writing from CUDA

float4 *dptr;

cudaGLMapBufferObject( (void**)&dptr, vbo);

clock_t start = clock();

// execute the kernel

dim3 block(BLOCK_SIZE, BLOCK_SIZE, 1);

dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);

kernel<<< grid, block>>>( mesh_width, mesh_height, device_Source);

clock_t stop = clock();

tempsTot = tempsTot + stop - start;


if (nbIter == nbMax) {

	tempsTot = tempsTot * 0.000001;

	printf("temps total %2f\n",tempsTot);

	double nbCelParSec = mesh_height * mesh_width * (nbMax+1) / tempsTot;

	printf("nbCelParSec %2f\n",nbCelParSec);


cudaGLUnmapBufferObject( vbo);


So this basically just starts and stops a time to sum the execution time of a 1000 iterations of the kernel. What I don’t understand is that if I comment lines: “cudaGLMapBufferObject( (void**)&dptr, vbo);” and “cudaGLUnmapBufferObject( vbo);” I lose half the performance.

Any ideas?

Kernel invocations are asynchronous, you might want to throw in a cudaThreadSynchronize() after the kernel call. Then your measurement should be correct.

Would you be willing to share your CUDA kernel?

I may also try some Game of Life implementation, just for fun. But I want the data to be more tightly packed, for example using 1 bit or one nibble (4 bits) per cell. So I can run a giant grid, say 16384x16384 cells in the RAM of the card. Graphical output at interactive rates is not a priority for me.

This would make for a nice contest. Who provides the fastest Game of Life implementation (iterations/sec) on a 16kx16k grid. Total score is divided by the peak memory consumption of the implementation on the card. ;)

Oh, I remember these nice days on AMD 5x86 @ 133Mhz with my 1024x1024 implementation :-)

Have you thought about reversed life? I.e. find a smallest previous step :-)