Hi there
I am new to CUDA, and I need some help to understand my results. Here goes…
I have written a small benchmark framework for testing CUDA(compiled using g++), that calls my CUDA related functions that are responsible for executing kernels(compiled using nvcc).
In my CUDA code I execute the kernels using something like this:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);
When the above kernel call returns, control will return to the benchmark framework (compiled using g++), which then later, calls a function that verifies the result of the kernel operation (compiled using nvcc) and frees the memory allocated on the device. However if I do not copy the result back to the host in the function that executes the kernel, I experience a massive performance drop in my verification function. By simply changing the above code to:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(m / dimBlock.x, n / dimBlock.y);
name##Kernel<<<dimGrid, dimBlock>>>(A, B, C, m, n);
float Ch = (float)calloc(mn, sizeof(float));
cudaMemcpy(Ch, C, mn*sizeof(float), cudaMemcpyDeviceToHost);
free(Ch);
The performance is perfect. My best guess is that the compiler fails to make certain optimizations because I let control return to the code compiled by g++, but can anyone confirm this?
Thanks
Kind Regards Toke