Logging the trace of memory accesses in the GPU trace logging


I am doing a small research on comparing the memory behavior of CPU process with GPU process. The basic idea is quite simple: same program is built for the CPU execution and the GPU execution with CUDA.

I can easily get the CPU trace by logging the dynamic trace of the CPU process using the debugger. However, as the debugger only stretches within the CPU address space, it seems to me that there is no way to get the GPU trace at all.

Is there any way/tool to log the dynamic CUDA trace?
Any comment would be greatly helpful.

Thank you.

set the environment variable CUDA_PROFILE to 1 and then run your app. There is more documentation available in the doc/ directory of the toolkit download.

Doesn’t running your application in debug mode cause performance problems though?

Thanks for the reply. In fact, I know to get the profile as you showed, however this profiling is too brief to be used as a trace.

I need something more detailed trace like the dynamic trace (or dynamic stream) of the disassembly codes running in the GPU. Or at least more detailed information of how many bytes were being transferred in each CUDA API calls happened in the GPU.

Would there be any way or tool to realize tool?

Thanks again.

No such tracing tool exists on the GPU. When I perform these kinds of comparisons, I just go through the kernel by hand and count the number of global memory reads and such, then output statistics based on the counted values. But then my kernels perform a very predictable set of memory reads based on their input so this is relatively easy to do.

You can use the performance counters in the 1.1 profiler to get a measure of how many warp reads are performed on a single multiprocessor, but that again lacks byte count information AFAIK. Maybe it counts a float4 read as 4 reads, I’m not sure.

Thank you very much. I will give it a try. :))

I really appreciate your help.

Be sure to read the profiler docs. I think the counts it returns are for just one multiprocessor (or maybe one block), so they are good for relative timing, but extrapolating them to absolute counts for the whole kernel will take some multiplication.