A far as i know, the cpu should get the control right after the kernel launch ( i understand that we cannot know when a kernel launch completes ).
My kernel takes almost 1.2 sec to complete processing in gpu. ( i assume it wont take 1.2 sec to launch kernel )
and it seems that the kernel function is only returning after completing the execution.
RenderFrame<<< dimBlocksPerGrid, dimThreadsPerBlock >>>(fpOPFrameGpu, nSlice, nMinrow, nMaxrow - nMinrow );
cutStopTimer( uiKernelTimer );
printf(" Kernel time %f \n", cutGetTimerValue( uiKernelTimer ));
here the presence and absence of “cudaThreadSynchronize()” shows the same timing result.
I’m using cuda 1.1
Thanks in advance.
Is this a derived version of CUT_CHECK_ERROR?
Because that macro has a CudaThreadSynchronize in it…
void checkCUDAError(const char *msg)
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
commenting it won’t make any change.
Have you enabled profiling or the sync after every kernel launch environment variables? Those will implicitly sync after every kernel launch.
Is RenderFrame the first call you make to any CUDA funtion? If so, then there is an implicit driver/GPU initialization which takes a significant amount of time.
Are you calling this in a loop? Only ~100 async launches can be queued up in recent drivers (16 in older CUDA 1.1 drivers). After that you will get implicit syncs.
oh. yes… it seems i inadvertently enabled profiling , made it ‘0’
but still it seems to be blocking :( .
the kernel is launched after calling, cudatime, memcopy, and bindtexture fns.
Sometimes with profiling enabled, it “sticks” on even after you set the variable to 0. Try running the app after a clean boot.
i assumed that… have done a clean boot … but still it blocks there . :S