calling a device functions inside kernels

Mars_7 · August 14, 2013, 8:07pm

I’ve a small code which is going in 512 threads and in 10 000 thread blocks.
The processing takes about 14 ms.
I’ve to call the same code 2 times with different variables in each thread,
and I can do it in two ways:

copy the same fragment of code two times and change the name of the variables in one part of code
prepare the device function and call 2 times with different arguments.

I’ve implemented these two options and in the second case the whole processing takes 1 ms more time.

Have you got the same experience?
Does it cost a lot to call a device function?

Best wishes,

Mars

seibert · August 14, 2013, 8:41pm

In nearly all cases, device functions are inlined, so I don’t have a good idea why you would get different code in the two cases.

JFSebastian · August 14, 2013, 9:04pm

My understanding (but correct me if I’m wrong) is the following.

CUDA does not support function inlining across different compilation units. This could be a possible reason of the 1ms overhead.
Within a single compilation unit, function inlining is performed at discretion of the compiler, which decides whether inlining is likely to improve performance. Accordingly, the __device__ function the user is talking about could actually be uninlined. What happens if the user tries to force inlining by the attribute __forceinline__? Does the situation improves?

I have a last question. How did you time the kernel launch? Did you take some averages between different kernel launches?

Mars_7 · August 16, 2013, 2:17pm

Hi, first of all thank you for your replies:)

I would like to find the true, and I spend several hours for testing. I would like to show you my results.
Now I simplified the code to solve the problem only and now the time of execution is smaller than in previous post.

The function about I wrote in previous post was prepared for calculation of amplitude and phase difference between samples after Fourier transformation. In previous version I had only one function which has returned type “void” and the results were returned by reference to the variable (argument). Generally it is not a good idea - it slowed down the kernel dramatically. I decided to prepare two simple function without reference:

__forceinline __device__ float calculateAmplitude(float re1, float im1, float re2,float im2)
{
	re1 = re1*re1+ im1*im1;
	return sqrtf(re1);
}

__forceinline __device__ float calculatePhase(float re1, float im1, float re2,float im2)
{
	register float x,y;
	y = im1*re2 - re1*im2;
	x = re1*re2 + im1*im2;
	return atan2f(y,x);
}

And now the most important parts of kernel function code:

__global__ void kernelFunction(...)
{
	// calculation of indexes and taking data from device memory

	*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
	*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);

	*(data_out+index)   =  calculateAmplitude(re1,im1,re2,im2);
	*(data_out+index+windowLength) =  calculateAmplitude(re2,im2,re3,im3);

	// index increment, and taking the new data

	*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
	*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);

	*(data_out+index)   =  calculateAmplitude(re1,im1,re2,im2);
	*(data_out+index+windowLength) =  calculateAmplitude(re2,im2,re3,im3);
}

And now the most interesting - the results from three tests:
Information from CUDA Profiler:
__GPU Time[us]_______grid_size______th. block size______reg. per thread_____occupancy

4109.79___________[4999 1 1]______[512 1 1]___________19__________________1
4685.34___________[4999 1 1]______[512 1 1]___________23__________________0.667
3894.75___________[4999 1 1]______[512 1 1]___________19__________________1

The first result refers to situation presented in code.
The second result was obtained with calling function without __forceinline, and I suppose that the function arguments needs additional 4 registers, and therefore the occupancy was limited by registers (I’ve checked it in occupancy calculator also).
And finally there are results for code without any functions, all code was copied several times and it looks like this:

...
	y = im1*(re2) - re1*(im2);
	x = re1*(re2) + im1*(im2);
	*(flow_out+index) = atan2f(y,x);

	re1 = (re1)*(re1)+ im1*(im1);
 	*(data_out+index) = sqrtf(re1);

	y = im2*(re3) - re2*(im3);
	x = re2*(re3) + im2*(im3);
	*(flow_out+index+windowLength) = atan2f(y,x);
	...

And the answers for JFSebastian: generally I’m testing the time of kernel execution by utilizing windows library and functions:
QueryPerformanceFrequency(&countsPerSec);
QueryPerformanceCounter(&tim1);
it gives me more less the same time like CUDA Profiler… and it’s faster and easier… but when I’m looking for something strange like this calling a device functions I’m using CUDA Profiler.
In this case I didn’t take any averages because the difference in time in several runs are less than 1-2%.
Thank you for your suggestions with __forceinline command:) I’ve completely forgotten about it.
You told about different compilation units, please correct me if I’m wrong - when all these three functions are in the same file, I’m in the same compilation unit, is it true?

And generally, what do you think? Am I doing something wrong, or it is normal to lose this more than 200 us even if I’m using __forceinline.
If you think that there may be something wrong in the rest of code I can add it. I’ve shown the most important parts from my point of view to show it more clearly.

Thank you for your replies:)

JFSebastian · August 16, 2013, 9:15pm

I have two comments:

There is a recent thread on this forum concerning timing CUDA kernels, see https://devtalk.nvidia.com/default/topic/572637/cuda-programming-and-performance/number-of-gpu-clock-cycles/. I would suggest to take a look at it. SPWorley was also recommending the use of clock64() for microbenchmarking purposes. It might be of interest to you.
In CUDA language, forceinline requires a double underscore both before and after the keyword, that is, __forceinline__; __forceinline, without the double underscore after the keyword, as in your code snippet, is C++.

I have experience a successful use of forceinline for developing a CUDA library based on expression templates.

Finally, yes, you are in the case when the three functions are in the same compilation units.

Topic		Replies	Views
Timing Error CUDA Programming and Performance	7	4919	June 16, 2008
Why would putting __device__ functions in the same file as kernels make them faster? CUDA Programming and Performance	4	721	June 7, 2017
Performance leakage due excessive API times CUDA Programming and Performance	5	652	May 24, 2019
How to re-structure code for CUDA (.cu, .cuh, .c)? CUDA Programming and Performance	4	5155	August 19, 2009
Nesting kernels Can I do this in CUDA? CUDA Programming and Performance	11	12564	January 4, 2010
Performance loss when implementing with several .cu files CUDA Programming and Performance	6	833	October 9, 2014
function pointers CUDA Programming and Performance	8	10692	June 15, 2007
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2056	May 30, 2015
What about calling non __device__ function inside kernel? Feature suggestion CUDA Programming and Performance	1	7828	June 3, 2011
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7778	March 8, 2013

calling a __device__ functions inside kernels

Related topics

calling a device functions inside kernels