Finding some free time, I tried to write a library to answer my own wish! Here is the result, gpu_trace.h!
Using this library, you can define your kernels to be traceable, and directly output values (as well as a tag and a message) from within your kernel in device mode. Here is a sample from included test.cu:
__global__ void test __traceable__ (int dummy)
{
int x = threadIdx.x;
__trace("Test", "int", x);
__trace("Test", "unsigned int", static_cast <unsigned int> (x));
__trace("Test", "long int", static_cast <long int> (x));
__trace("Test", "unsigned long int", static_cast <unsigned long int> (x));
__trace("Test", "float", static_cast <float> (x));
__trace("Test", "double", static_cast <double> (x));
for (int i = 0; i < x; i++)
__trace_exp("Loop", 3 + 2 * i);
}
Call the above kernel as:
int main()
{
INITIALIZE_TRACE_DATA();
test <<<10, 10>>> __traceable_call__ (0);
cudaError_t ErrorCode = cudaGetLastError();
if (ErrorCode != cudaSuccess)
printf("*** Kernel did not launch, %s ***\n", cudaGetErrorString(ErrorCode));
ErrorCode = cudaThreadSynchronize();
if (ErrorCode != cudaSuccess)
printf("*** Kernel exited while executing, %s ***\n", cudaGetErrorString(ErrorCode));
FINALIZE_TRACE_DATA();
PRINT_TRACE_DATA(stdout);
return 0;
}
Sample output:
GPU Trace: collected trace data:
== Thread 0: 7 trace packets ================================
[Test ][int ][int: 1]
[Test ][unsigned int ][unsigned int: 1]
[Test ][long int ][long int: 1]
[Test ][unsigned long i][unsigned long int: 1]
[Test ][float ][float: 1]
[Test ][double ][double: 1]
[Loop ][3 + 2 * i ][int: 3]
== Thread 1: 8 trace packets ================================
[Test ][int ][int: 2]
[Test ][unsigned int ][unsigned int: 2]
[Test ][long int ][long int: 2]
[Test ][unsigned long i][unsigned long int: 2]
[Test ][float ][float: 2]
[Test ][double ][double: 2]
[Loop ][3 + 2 * i ][int: 3]
[Loop ][3 + 2 * i ][int: 5]
I have tried the macros to be as similar as possible to CUDA conventions, like global and device. If you compile the code with -D__ENABLE_TRACE__, you will see the trace data, and if not, the program works silently as the original version. This eliminates the need to remove trace specific extensions, even in final code. You need to have at least one parameter in your kernel. If this is not the case with your kernel, pass a dummy argument to it, as in the above example. Refer to included test.cu for more information.
Some of you might think that using a debugger is much better. Well, maybe! In my opinion, at least in less complex cases, it is much easier to use this library which takes almost no time. I am sure nVidia can do this much better, and I strongly suggest them to do so.
Please let me know your opinions and experiences. Enhanced versions of this library are also welcome!
Thank you!
-Edit:
Remember to include stdio.h before gpu_trace.h for this version. Will add it in the next version.
-Edit:
Library updated to version 0.02. Added __trace_exp(tag, exp) and a conditional #include <stdio.h>.
gpu_trace.tar.gz (2.76 KB)