GPU Trace library Easily trace vaules from your kernels in device mode!

Tell_Me_Why · July 31, 2009, 7:31pm

Finding some free time, I tried to write a library to answer my own wish! Here is the result, gpu_trace.h!

Using this library, you can define your kernels to be traceable, and directly output values (as well as a tag and a message) from within your kernel in device mode. Here is a sample from included test.cu:

__global__ void test __traceable__ (int dummy)

{

	int x = threadIdx.x;

	__trace("Test", "int", x);

	__trace("Test", "unsigned int", static_cast <unsigned int> (x));

	__trace("Test", "long int", static_cast <long int> (x));

	__trace("Test", "unsigned long int", static_cast <unsigned long int> (x));

	__trace("Test", "float", static_cast <float> (x));

	__trace("Test", "double", static_cast <double> (x));

	

	for (int i = 0; i < x; i++)

		__trace_exp("Loop", 3 + 2 * i);

}

Call the above kernel as:

int main()

{

	INITIALIZE_TRACE_DATA();

		

	test <<<10, 10>>> __traceable_call__ (0);

	

	cudaError_t ErrorCode = cudaGetLastError();

	if (ErrorCode != cudaSuccess)

		printf("*** Kernel did not launch, %s ***\n", cudaGetErrorString(ErrorCode));

	ErrorCode = cudaThreadSynchronize();

	if (ErrorCode != cudaSuccess)

		printf("*** Kernel exited while executing, %s ***\n", cudaGetErrorString(ErrorCode));

	

	FINALIZE_TRACE_DATA();

	PRINT_TRACE_DATA(stdout);

	

	return 0;

}

Sample output:

GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

	[Test   ][int			][int: 1]

	[Test   ][unsigned int   ][unsigned int: 1]

	[Test   ][long int	   ][long int: 1]

	[Test   ][unsigned long i][unsigned long int: 1]

	[Test   ][float		  ][float: 1]

	[Test   ][double		 ][double: 1]

	[Loop   ][3 + 2 * i	  ][int: 3]

== Thread 1: 8 trace packets ================================

	[Test   ][int			][int: 2]

	[Test   ][unsigned int   ][unsigned int: 2]

	[Test   ][long int	   ][long int: 2]

	[Test   ][unsigned long i][unsigned long int: 2]

	[Test   ][float		  ][float: 2]

	[Test   ][double		 ][double: 2]

	[Loop   ][3 + 2 * i	  ][int: 3]

	[Loop   ][3 + 2 * i	  ][int: 5]

I have tried the macros to be as similar as possible to CUDA conventions, like global and device. If you compile the code with -D__ENABLE_TRACE__, you will see the trace data, and if not, the program works silently as the original version. This eliminates the need to remove trace specific extensions, even in final code. You need to have at least one parameter in your kernel. If this is not the case with your kernel, pass a dummy argument to it, as in the above example. Refer to included test.cu for more information.

Some of you might think that using a debugger is much better. Well, maybe! In my opinion, at least in less complex cases, it is much easier to use this library which takes almost no time. I am sure nVidia can do this much better, and I strongly suggest them to do so.

Please let me know your opinions and experiences. Enhanced versions of this library are also welcome!

Thank you!

-Edit:

Remember to include stdio.h before gpu_trace.h for this version. Will add it in the next version.

-Edit:

Library updated to version 0.02. Added __trace_exp(tag, exp) and a conditional #include <stdio.h>.
gpu_trace.tar.gz (2.76 KB)

bollig · July 31, 2009, 8:25pm

Cool, just verified that it works well for CUDA 2.1 in OSX. Of course I get junk trace data for the double value because my 8900M doesnt support double precision. I like that you have individual traces for each thread. Have you looked at how this impacts performance, memory consumption, etc? When I do not compile with ENABLE_TRACE does it remove the code entirely, or is there anything residual that will be left to consume registers and slow things down?

Also, how about the ability to change the trace log view to show all values from a single trace call together (i.e. see values of an array grouped together) rather than the trace by thread.

-Evan

Finding some free time, I tried to write a library to answer my own wish! Here is the result, gpu_trace.h!

Using this library, you can define your kernels to be traceable, and directly output values (as well as a tag and a message) from within your kernel in device mode. Here is a sample from included test.cu:
__global__ void test __traceable__ (int dummy)

{

	int x = threadIdx.x;

	__trace("Test", "int", x);

	__trace("Test", "unsigned int", static_cast <unsigned int> (x));

	__trace("Test", "long int", static_cast <long int> (x));

	__trace("Test", "unsigned long int", static_cast <unsigned long int> (x));

	__trace("Test", "float", static_cast <float> (x));

	__trace("Test", "double", static_cast <double> (x));

	

	for (int i = 0; i < x; i++)

		__trace("Loop", "i", i);

}
Call the above kernel as:
test <<<10, 10>>> __traceable_call__ (0);
I have tried the macros to be as similar as possible to CUDA conventions, like global and device. If you compile the code with -D__ENABLE_TRACE__, you will see the trace data, and if not, the program works silently as the original version. This eliminates the need to remove trace specific extensions, even in final code. You need to have at least one parameter in your kernel. If this is not the case with your kernel, pass a dummy argument to it, as in the above example. Refer to included test.cu for more information.

Some of you might think that using a debugger is much better. Well, maybe! In my opinion, at least in less complex cases, it is much easier to use this library which takes almost no time. I am sure nVidia can do this much better, and I strongly suggest them to do so.

Please let me know your opinions and experiences. Enhanced versions of this library are also welcome!

Thank you!

Tell_Me_Why · July 31, 2009, 9:05pm

Thank you for your feedback! I never expected the first results to be from OSX!

There should not be any performance impact when compiled without -D__ENABLE_TRACE__, as all the code is removed using ifdef’s. In trace mode, I doubt anyone would care for this, while it does not prevent launching the kernel.

About your suggestion, if you can show a good way to do this (i.e. how to find each piece in different kernel trace data), I will implement it. Also, the trace data is all there, and you are not bound to PRINT_TRACE_DATA(). You may write your own post-processor.

I was also thinking on a __trace_expression(tag, expression), which converts the expression to string using ‘#’ preprocessor directive, and passes them all to __trace. Any opinions?

Tell_Me_Why · August 3, 2009, 12:03pm

I updated the library to ver. 0.02. Just a few minor changes (e.g. adding #include <stdio.h>) and a new __trace_exp(tag, exp). See included test.cu for usage sample. I will update the first post to reflect the changes.

Would somebody please test it with 2.2 and 2.3 and report the results?

parallelis · August 3, 2009, 6:01pm

I will test it :-)

It’s a great addition that I would like to have to debug my code! Thanks!

Tell_Me_Why · August 4, 2009, 1:37pm

I am very glad that you find it useful. Thanks for your feedback and tests!

It would be great if nVidia guys reading this would also comment on it.

apaehler · August 5, 2009, 12:07am

Terrific stuff! Here’s the output from a GTX-260: SDK 2.3, driver 190.18.3, OpenSuSE 11.1 64-bit

paehler@nvidia> nvcc --gpu-architecture sm_13 -I . -o tgv test.cu -D__ENABLE_TRACE__

paehler@nvidia> tgv

GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

[Test   ][int            ][int: 1]

[Test   ][unsigned int   ][unsigned int: 1]

[Test   ][long int       ][long int: 1]

[Test   ][unsigned long i][unsigned long int: 1]

[Test   ][float          ][float: 1]

[Test   ][double         ][double: 1]

[Loop   ][3 + 2 * i      ][int: 3]

== Thread 1: 8 trace packets ================================

[Test   ][int            ][int: 2]

[Test   ][unsigned int   ][unsigned int: 2]

[Test   ][long int       ][long int: 2]

[Test   ][unsigned long i][unsigned long int: 2]

[Test   ][float          ][float: 2]

[Test   ][double         ][double: 2]

[Loop   ][3 + 2 * i      ][int: 3]

[Loop   ][3 + 2 * i      ][int: 5]

iceberg · August 5, 2009, 8:49am

Good work!

GTX 260, winxp pro 32bit, cuda v2.3, driver 190.38

[codebox]GPU Trace: collected trace data:

== Thread 0: 7 trace packets ================================

[Test   ][int            ][int: 1]

[Test   ][unsigned int   ][unsigned int: 1]

[Test   ][long int       ][long int: 1]

[Test   ][unsigned long i][unsigned long int: 1]

[Test   ][float          ][float: 1]

[Test   ][double         ][double: 1]

[Loop   ][3 + 2 * i      ][int: 3]

== Thread 1: 8 trace packets ================================

[Test   ][int            ][int: 2]

[Test   ][unsigned int   ][unsigned int: 2]

[Test   ][long int       ][long int: 2]

[Test   ][unsigned long i][unsigned long int: 2]

[Test   ][float          ][float: 2]

[Test   ][double         ][double: 2]

[Loop   ][3 + 2 * i      ][int: 3]

[Loop   ][3 + 2 * i      ][int: 5][/codebox]

Tell_Me_Why · August 5, 2009, 10:47am

@apaehler and iceberg:

Lots of thanks for testing! Glad that it worked for you!

Cudabean · August 27, 2009, 1:09am

Thank you :)

I was just about to write a device logging function, glad I searched first. Thank you for sharing.

Tell_Me_Why · August 29, 2009, 8:08am

Thank you for using. Please share your comments and improvements.

nitin.life · October 11, 2009, 10:45pm

works great… for my code… thanks… very much :)

Topic		Replies	Views
Logging the trace of memory accesses in the GPU trace logging CUDA Programming and Performance	5	7592	December 5, 2007
cuPrintf available CUDA Programming and Performance	9	16684	December 7, 2009
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	4	950	March 27, 2023
NSight: gpu trace profiler fails to capture trace on 4070 Nsight Graphics	20	471	April 11, 2025
Debugging cuda kernels: printing and analysis after ULF How to extract data from failing kernels? CUDA Programming and Performance	12	6463	March 9, 2009
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29847	April 29, 2010
Nsys Does not Show the kernels output Profiling Embedded Targets	21	3542	October 20, 2022
Tegra System Profiler missing CUDA traces with TX1 Jetson TX1	3	998	December 7, 2017
results of tracing with nsight CUDA Programming and Performance	0	581	January 10, 2015
Can attach for Frame Debugger, can't attach for GPU Trace Nsight Graphics	2	311	March 18, 2025

GPU Trace library Easily trace vaules from your kernels in device mode!

Related topics