Suppose we have some data (in array a_d) in Global Memory resulted from some computation on GPU. Now to print this data on the screen we need to use cudaMemcpy(…DeviceToHost), and then we can use a for loop to print the content of the array. Is it possible to have a direct function cudaPrintf(…DeviceToHost), which will basically combine cudaMemcpy and usual printf() C function?
How can we do this ?
Till now we don’t have such kind of a support…
However, if you really want to debug your code, use the device emulation mode (-deviceemu option for nvcc). In this mode, you can use the famous ‘fprintf’ statements inside the device functions and kernels. Once you are sure that the code is OK, you could compile it without the -deviceemu.
Now, here’s a small ‘#define’ trick to do this:
In every device function, use the following group of statements: #ifdef DEVICE_EMULATION
fprintf(stdout, “This is my debug statement inside this function!\n”); #endif
And, whenever you are passing -deviceemu option to nvcc, also pass another option saying -DDEVICE_EMULATION. You are done! :)