GPU Trace library Easily trace vaules from your kernels in device mode!

Finding some free time, I tried to write a library to answer my own wish! Here is the result, gpu_trace.h!

Using this library, you can define your kernels to be traceable, and directly output values (as well as a tag and a message) from within your kernel in device mode. Here is a sample from included test.cu:

__global__ void test __traceable__ (int dummy)

{

	int x = threadIdx.x;

	__trace("Test", "int", x);

	__trace("Test", "unsigned int", static_cast <unsigned int> (x));

	__trace("Test", "long int", static_cast <long int> (x));

	__trace("Test", "unsigned long int", static_cast <unsigned long int> (x));

	__trace("Test", "float", static_cast <float> (x));

	__trace("Test", "double", static_cast <double> (x));

	

	for (int i = 0; i < x; i++)

		__trace_exp("Loop", 3 + 2 * i);

}

Call the above kernel as:

int main()

{

	INITIALIZE_TRACE_DATA();

		

	test <<<10, 10>>> __traceable_call__ (0);

	

	cudaError_t ErrorCode = cudaGetLastError();

	if (ErrorCode != cudaSuccess)

		printf("*** Kernel did not launch, %s ***\n", cudaGetErrorString(ErrorCode));

	ErrorCode = cudaThreadSynchronize();

	if (ErrorCode != cudaSuccess)

		printf("*** Kernel exited while executing, %s ***\n", cudaGetErrorString(ErrorCode));

	

	FINALIZE_TRACE_DATA();

	PRINT_TRACE_DATA(stdout);

	

	return 0;

}

Sample output:

GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

	[Test   ][int			][int: 1]

	[Test   ][unsigned int   ][unsigned int: 1]

	[Test   ][long int	   ][long int: 1]

	[Test   ][unsigned long i][unsigned long int: 1]

	[Test   ][float		  ][float: 1]

	[Test   ][double		 ][double: 1]

	[Loop   ][3 + 2 * i	  ][int: 3]

== Thread 1: 8 trace packets ================================

	[Test   ][int			][int: 2]

	[Test   ][unsigned int   ][unsigned int: 2]

	[Test   ][long int	   ][long int: 2]

	[Test   ][unsigned long i][unsigned long int: 2]

	[Test   ][float		  ][float: 2]

	[Test   ][double		 ][double: 2]

	[Loop   ][3 + 2 * i	  ][int: 3]

	[Loop   ][3 + 2 * i	  ][int: 5]

I have tried the macros to be as similar as possible to CUDA conventions, like global and device. If you compile the code with -D__ENABLE_TRACE__, you will see the trace data, and if not, the program works silently as the original version. This eliminates the need to remove trace specific extensions, even in final code. You need to have at least one parameter in your kernel. If this is not the case with your kernel, pass a dummy argument to it, as in the above example. Refer to included test.cu for more information.

Some of you might think that using a debugger is much better. Well, maybe! In my opinion, at least in less complex cases, it is much easier to use this library which takes almost no time. I am sure nVidia can do this much better, and I strongly suggest them to do so.

Please let me know your opinions and experiences. Enhanced versions of this library are also welcome!

Thank you!

-Edit:

Remember to include stdio.h before gpu_trace.h for this version. Will add it in the next version.

-Edit:

Library updated to version 0.02. Added __trace_exp(tag, exp) and a conditional #include <stdio.h>.
gpu_trace.tar.gz (2.76 KB)

Cool, just verified that it works well for CUDA 2.1 in OSX. Of course I get junk trace data for the double value because my 8900M doesnt support double precision. I like that you have individual traces for each thread. Have you looked at how this impacts performance, memory consumption, etc? When I do not compile with ENABLE_TRACE does it remove the code entirely, or is there anything residual that will be left to consume registers and slow things down?

Also, how about the ability to change the trace log view to show all values from a single trace call together (i.e. see values of an array grouped together) rather than the trace by thread.

-Evan

Thank you for your feedback! I never expected the first results to be from OSX!

There should not be any performance impact when compiled without -D__ENABLE_TRACE__, as all the code is removed using ifdef’s. In trace mode, I doubt anyone would care for this, while it does not prevent launching the kernel.

About your suggestion, if you can show a good way to do this (i.e. how to find each piece in different kernel trace data), I will implement it. Also, the trace data is all there, and you are not bound to PRINT_TRACE_DATA(). You may write your own post-processor.

I was also thinking on a __trace_expression(tag, expression), which converts the expression to string using ‘#’ preprocessor directive, and passes them all to __trace. Any opinions?

I updated the library to ver. 0.02. Just a few minor changes (e.g. adding #include <stdio.h>) and a new __trace_exp(tag, exp). See included test.cu for usage sample. I will update the first post to reflect the changes.

Would somebody please test it with 2.2 and 2.3 and report the results?

I will test it :-)

It’s a great addition that I would like to have to debug my code! Thanks!

I am very glad that you find it useful. Thanks for your feedback and tests!

It would be great if nVidia guys reading this would also comment on it.

Terrific stuff! Here’s the output from a GTX-260: SDK 2.3, driver 190.18.3, OpenSuSE 11.1 64-bit

paehler@nvidia> nvcc --gpu-architecture sm_13 -I . -o tgv test.cu -D__ENABLE_TRACE__

paehler@nvidia> tgv

GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

[Test   ][int            ][int: 1]

[Test   ][unsigned int   ][unsigned int: 1]

[Test   ][long int       ][long int: 1]

[Test   ][unsigned long i][unsigned long int: 1]

[Test   ][float          ][float: 1]

[Test   ][double         ][double: 1]

[Loop   ][3 + 2 * i      ][int: 3]

== Thread 1: 8 trace packets ================================

[Test   ][int            ][int: 2]

[Test   ][unsigned int   ][unsigned int: 2]

[Test   ][long int       ][long int: 2]

[Test   ][unsigned long i][unsigned long int: 2]

[Test   ][float          ][float: 2]

[Test   ][double         ][double: 2]

[Loop   ][3 + 2 * i      ][int: 3]

[Loop   ][3 + 2 * i      ][int: 5]

Good work!

GTX 260, winxp pro 32bit, cuda v2.3, driver 190.38

[codebox]GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

[Test   ][int            ][int: 1]

[Test   ][unsigned int   ][unsigned int: 1]

[Test   ][long int       ][long int: 1]

[Test   ][unsigned long i][unsigned long int: 1]

[Test   ][float          ][float: 1]

[Test   ][double         ][double: 1]

[Loop   ][3 + 2 * i      ][int: 3]

== Thread 1: 8 trace packets ================================

[Test   ][int            ][int: 2]

[Test   ][unsigned int   ][unsigned int: 2]

[Test   ][long int       ][long int: 2]

[Test   ][unsigned long i][unsigned long int: 2]

[Test   ][float          ][float: 2]

[Test   ][double         ][double: 2]

[Loop   ][3 + 2 * i      ][int: 3]

[Loop   ][3 + 2 * i      ][int: 5][/codebox]

@apaehler and iceberg:

Lots of thanks for testing! Glad that it worked for you!

Thank you :)

I was just about to write a device logging function, glad I searched first. Thank you for sharing.

Thank you for using. Please share your comments and improvements.

works great… for my code… thanks… very much :)