Profiling particular cuda function

Is there a way to profile arbitrary device function?

I know how to profile whole kernel using cudaprof, but I would like to see how a function performs compared to total kernel execution time. Or maybe there is some other smart and tricky way to get a rough function performance. Also the new Fermi architecture should have support of function pointers, so that it should allow to call functions at runtime, as opposite to current compiletime function inlining behavior – but I don’t know how to exploit this feature in cudaprof.

Do you guys know, how to use cudaprof Profile Triggers and what is it good for?

The biggest obstacle to profiling device functions is that there actually is no such thing as device functions in the code the GPU runs. The compiler inline expands all device functions inside kernels, and afterwards there is no function anymore and nothing that typical profiler can hook into to instrument function calls.

There is a clock() function available in device code that can be used to instrument code sections, although it has been demonstrated that runtime optimization of code can move clock instructions around and make microbenchmarking of compact code sections a bit hit and miss.

Thanks for this hint. I tried to adopt the clock example from the cuda sdk. It can be used for measuring a kernel as whole – the resulting clock value divided by a gpu clock rate (kHz) gives aprox the same time value as a gettimeofday function measuring kernel execution time at host side. The example writes down start and end clock values per each thread block. Then at the host side, minimal start and maximal end values are found and difference of those two values gives the resulting clock value for whole kernel execution.

To measure just a part of a kernel (a couple of lines, or a function) I stored clock values at the start and the end of the kernel part - this is similar to the cuda clock example. At host side, I searched for thread block with largest start end difference. Something like:

kernel:

...

	if (tid == 0) timer[blockId] = clock();

	//measured part of kernel

	if (tid == 0) timer[blockId + BLOCK_DIM] = clock();

	...

host:

for (int i = 0; i < numBlocks; i++) {

		start  = timer[i];

		end	= timer[i + numBlocks];		

		time = (end - start) > time ? (end-start) : time;

	}

Thanks for this hint. I tried to adopt the clock example from the cuda sdk. It can be used for measuring a kernel as whole – the resulting clock value divided by a gpu clock rate (kHz) gives aprox the same time value as a gettimeofday function measuring kernel execution time at host side. The example writes down start and end clock values per each thread block. Then at the host side, minimal start and maximal end values are found and difference of those two values gives the resulting clock value for whole kernel execution.

To measure just a part of a kernel (a couple of lines, or a function) I stored clock values at the start and the end of the kernel part - this is similar to the cuda clock example. At host side, I searched for thread block with largest start end difference. Something like:

kernel:

...

	if (tid == 0) timer[blockId] = clock();

	//measured part of kernel

	if (tid == 0) timer[blockId + BLOCK_DIM] = clock();

	...

host:

for (int i = 0; i < numBlocks; i++) {

		start  = timer[i];

		end	= timer[i + numBlocks];		

		time = (end - start) > time ? (end-start) : time;

	}

The KISS method I often use to measure function overhead is really crude but is so easy that it can quickly tell you if a certain routine is a bottleneck or not.

The trick is to first benchmark your unchanged code, finding total runtime with a typical input as a base reference.

Now you go back to your source and DOUBLE the invocations of your device functions in question. You can’t eliminate them (since that could change your programs behavior) but you can do things like changing

x=myDeviceFunction(y, z);

to

x=myDeviceFunction(y, z);

x=(x+myDeviceFunction(y, z))/2;

The exact “doubling” depends on your device function, side effects, whether it returns an integer or float, etc.

Now you compare the timing with the “doubled” function and you’ll get a SLOWER runtime since you’re doing more work. The difference in runtimes, divided by your initial runtime, is the fraction of time that the device function in question uses… allowing you to see if it was a bottleneck and how much.

This all works even if code is expanded and inlined, etc. (You do have to be careful that your “doubling” can’t be optimized away by the compiler, so make sure return values are used…)

Simple, crude, and pretty inelegant but it works pretty well and you can do it in 30 seconds. I usually get more intuition from such crude tests than I do from the profiler.

If your device function is called from many places, you can just make a wrapper… that is, rename the device function to “REALmyDeviceFunction” then make a “myDeviceFunction” that does the double invocation. all your orginal code doesn’t need to change since it’s still calling “myDeviceFunction”.

The KISS method I often use to measure function overhead is really crude but is so easy that it can quickly tell you if a certain routine is a bottleneck or not.

The trick is to first benchmark your unchanged code, finding total runtime with a typical input as a base reference.

Now you go back to your source and DOUBLE the invocations of your device functions in question. You can’t eliminate them (since that could change your programs behavior) but you can do things like changing

x=myDeviceFunction(y, z);

to

x=myDeviceFunction(y, z);

x=(x+myDeviceFunction(y, z))/2;

The exact “doubling” depends on your device function, side effects, whether it returns an integer or float, etc.

Now you compare the timing with the “doubled” function and you’ll get a SLOWER runtime since you’re doing more work. The difference in runtimes, divided by your initial runtime, is the fraction of time that the device function in question uses… allowing you to see if it was a bottleneck and how much.

This all works even if code is expanded and inlined, etc. (You do have to be careful that your “doubling” can’t be optimized away by the compiler, so make sure return values are used…)

Simple, crude, and pretty inelegant but it works pretty well and you can do it in 30 seconds. I usually get more intuition from such crude tests than I do from the profiler.

If your device function is called from many places, you can just make a wrapper… that is, rename the device function to “REALmyDeviceFunction” then make a “myDeviceFunction” that does the double invocation. all your orginal code doesn’t need to change since it’s still calling “myDeviceFunction”.

Thanks SPWorley. It’s smart. :)

BTW, the Kiss method is an official name of measuring procedure?

Thanks SPWorley. It’s smart. :)

BTW, the Kiss method is an official name of measuring procedure?

It’s the fundamental rule of algorithm design and debugging.

It’s the fundamental rule of algorithm design and debugging.

[too slow]

[too slow]