While testing our code on a Solaris 10 box with Tesla C870 GPU we found that our kernel executes asynchronously even though we have not specified it to do so.
What we knew is that unless sepcified the kernel and the host code run in a synchonous manner.
The code snippet is as follows:
//Computing Covariance on GPU
dim3 cv_grid(SYMBOL_COUNT,SYMBOL_COUNT,1);
dim3 cv_block(1,1,1);
CUT_SAFE_CALL(cutResetTimer(timer));
CUT_SAFE_CALL(cutStartTimer(timer));
cv_kernel<<<cv_grid, cv_block>>>(sym_data, avg_data, cv_data);
// check if kernel execution generated an error
CUT_CHECK_ERROR("Kernel Execution Failed!");
//copy covariance result from device to host
CUT_SAFE_CALL( cutStopTimer( timer));
printf("Time taken by GPU to compute covariance is: %f (ms)\n", cutGetTimerValue( timer));
CUT_SAFE_CALL(cutResetTimer(timer));
CUT_SAFE_CALL(cutStartTimer(timer));
CUDA_SAFE_CALL( cudaMemcpy(h_cv_data,cv_data, cv_mem_size,
cudaMemcpyDeviceToHost));
CUT_SAFE_CALL( cutStopTimer( timer));
printf("Result transfer time is: %f (ms)\n", cutGetTimerValue( timer));
CUT_SAFE_CALL( cutDeleteTimer( timer));
// cleanup memory
CUDA_SAFE_CALL(cudaFreeHost(h_sym_data));
free(h_avg_data);
CUDA_SAFE_CALL(cudaFreeHost(h_cv_data));
CUDA_SAFE_CALL(cudaFree(sym_data));
CUDA_SAFE_CALL(cudaFree(avg_data));
CUDA_SAFE_CALL(cudaFree(cv_data));
In this code after careful analysis we realised that the first timer gives us the time taken by the system to fork the kernel process, while the host transfer time actually includes the kernel processing time. This is how we got to the conclusion. On commenting out the cudaMemcpy line and its timers, the first timer for the kernel was showing us 0.008 ms but inspite of that the entire program took > 4mins, while actually it shud have taken not more than a second. On the other hand, after commenting out the kernel call
the program ran in no time. We observed that the main waits for the kernel to complete execution and only then it exits even though the host code is completely executed.
I do know that Cuda support on Solaris is work in progress. But on other platforms
is this expected behaviour? Or should the host and device code be running synchronously. In that case I think this could be a bug in Cuda import on Solaris.
Appreciate your insight.
Thanks,
Shweta