cudaMemcpy and compiler optimizations

Hi there

I am new to CUDA, and I need some help to understand my results. Here goes…
I have written a small benchmark framework for testing CUDA(compiled using g++), that calls my CUDA related functions that are responsible for executing kernels(compiled using nvcc).

In my CUDA code I execute the kernels using something like this:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);

When the above kernel call returns, control will return to the benchmark framework (compiled using g++), which then later, calls a function that verifies the result of the kernel operation (compiled using nvcc) and frees the memory allocated on the device. However if I do not copy the result back to the host in the function that executes the kernel, I experience a massive performance drop in my verification function. By simply changing the above code to:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);
float Ch = (float)calloc(mn, sizeof(float));
cudaMemcpy(Ch, C, m
n*sizeof(float), cudaMemcpyDeviceToHost);
free(Ch);

The performance is perfect. My best guess is that the compiler fails to make certain optimizations because I let control return to the code compiled by g++, but can anyone confirm this?

Thanks

Kind Regards Toke

cudaMemcpy implies a cudaThreadSynchronize before it. this isn’t an optimization question or anything like that, you’re not measuring the real performance of your kernel because kernel launches are asynchronous.

Thanks, this actually solved my problem