Deafult asynchronous kernel execution on Solaris 10: Bug or normal execution?

While testing our code on a Solaris 10 box with Tesla C870 GPU we found that our kernel executes asynchronously even though we have not specified it to do so.
What we knew is that unless sepcified the kernel and the host code run in a synchonous manner.

The code snippet is as follows:

//Computing Covariance on GPU
dim3 cv_block(1,1,1);
cv_kernel<<<cv_grid, cv_block>>>(sym_data, avg_data, cv_data);

// check if kernel execution generated an error
CUT_CHECK_ERROR("Kernel Execution Failed!");
//copy covariance result from device to host

CUT_SAFE_CALL( cutStopTimer( timer));
printf("Time taken by GPU to compute covariance is: %f (ms)\n", cutGetTimerValue( timer));

CUDA_SAFE_CALL( cudaMemcpy(h_cv_data,cv_data, cv_mem_size,
CUT_SAFE_CALL( cutStopTimer( timer));
printf("Result transfer time is: %f (ms)\n", cutGetTimerValue( timer));

CUT_SAFE_CALL( cutDeleteTimer( timer));
// cleanup memory

In this code after careful analysis we realised that the first timer gives us the time taken by the system to fork the kernel process, while the host transfer time actually includes the kernel processing time. This is how we got to the conclusion. On commenting out the cudaMemcpy line and its timers, the first timer for the kernel was showing us 0.008 ms but inspite of that the entire program took > 4mins, while actually it shud have taken not more than a second. On the other hand, after commenting out the kernel call
the program ran in no time. We observed that the main waits for the kernel to complete execution and only then it exits even though the host code is completely executed.

I do know that Cuda support on Solaris is work in progress. But on other platforms
is this expected behaviour? Or should the host and device code be running synchronously. In that case I think this could be a bug in Cuda import on Solaris.

Appreciate your insight.


I was under the impression that kernel launches were asynchronous and return execution immediately to the host thread. Using cudaThreadSynchronize() thereafter can wait for all gpu tasks to complete.

Thanks for your response. You are correct. After some bit of reading i realized by default the call is asynchronous.

So the code is working as it should.

Hi Shweta,
Can you tell me how you got CUDA running on a Solaris system? I tried looking up everywhere on Google but was unable to find the link to CUDA libaries on the Solaris system.
I’d really appreciate any links/help.