cudaMemcpy and compiler optimizations

tjh · December 1, 2008, 10:56pm

Hi there

I am new to CUDA, and I need some help to understand my results. Here goes…
I have written a small benchmark framework for testing CUDA(compiled using g++), that calls my CUDA related functions that are responsible for executing kernels(compiled using nvcc).

In my CUDA code I execute the kernels using something like this:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);

When the above kernel call returns, control will return to the benchmark framework (compiled using g++), which then later, calls a function that verifies the result of the kernel operation (compiled using nvcc) and frees the memory allocated on the device. However if I do not copy the result back to the host in the function that executes the kernel, I experience a massive performance drop in my verification function. By simply changing the above code to:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);
float Ch = (float)calloc(mn, sizeof(float));
cudaMemcpy(Ch, C, mn*sizeof(float), cudaMemcpyDeviceToHost);
free(Ch);

The performance is perfect. My best guess is that the compiler fails to make certain optimizations because I let control return to the code compiled by g++, but can anyone confirm this?

Thanks

Kind Regards Toke

tmurray · December 2, 2008, 12:54am

cudaMemcpy implies a cudaThreadSynchronize before it. this isn’t an optimization question or anything like that, you’re not measuring the real performance of your kernel because kernel launches are asynchronous.

tjh · December 2, 2008, 7:00am

Thanks, this actually solved my problem