calling a __device__ functions inside kernels

I’ve a small code which is going in 512 threads and in 10 000 thread blocks.
The processing takes about 14 ms.
I’ve to call the same code 2 times with different variables in each thread,
and I can do it in two ways:

  1. copy the same fragment of code two times and change the name of the variables in one part of code
  2. prepare the device function and call 2 times with different arguments.

I’ve implemented these two options and in the second case the whole processing takes 1 ms more time.

Have you got the same experience?
Does it cost a lot to call a device function?

Best wishes,


In nearly all cases, device functions are inlined, so I don’t have a good idea why you would get different code in the two cases.

My understanding (but correct me if I’m wrong) is the following.

  1. CUDA does not support function inlining across different compilation units. This could be a possible reason of the 1ms overhead.
  2. Within a single compilation unit, function inlining is performed at discretion of the compiler, which decides whether inlining is likely to improve performance. Accordingly, the __device__ function the user is talking about could actually be uninlined. What happens if the user tries to force inlining by the attribute __forceinline__? Does the situation improves?

I have a last question. How did you time the kernel launch? Did you take some averages between different kernel launches?

Hi, first of all thank you for your replies:)

I would like to find the true, and I spend several hours for testing. I would like to show you my results.
Now I simplified the code to solve the problem only and now the time of execution is smaller than in previous post.

The function about I wrote in previous post was prepared for calculation of amplitude and phase difference between samples after Fourier transformation. In previous version I had only one function which has returned type “void” and the results were returned by reference to the variable (argument). Generally it is not a good idea - it slowed down the kernel dramatically. I decided to prepare two simple function without reference:

__forceinline __device__ float calculateAmplitude(float re1, float im1, float re2,float im2)
	re1 = re1*re1+ im1*im1;
	return sqrtf(re1);

__forceinline __device__ float calculatePhase(float re1, float im1, float re2,float im2)
	register float x,y;
	y = im1*re2 - re1*im2;
	x = re1*re2 + im1*im2;
	return atan2f(y,x);

And now the most important parts of kernel function code:

__global__ void kernelFunction(...)
	// calculation of indexes and taking data from device memory

	*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
	*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);

	*(data_out+index)   =  calculateAmplitude(re1,im1,re2,im2);
	*(data_out+index+windowLength) =  calculateAmplitude(re2,im2,re3,im3);

	// index increment, and taking the new data

	*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
	*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);

	*(data_out+index)   =  calculateAmplitude(re1,im1,re2,im2);
	*(data_out+index+windowLength) =  calculateAmplitude(re2,im2,re3,im3);

And now the most interesting - the results from three tests:
Information from CUDA Profiler:
__GPU Time[us]_______grid_size______th. block size______reg. per thread_____occupancy

  1. 4109.79___________[4999 1 1]______[512 1 1]___________19__________________1
  2. 4685.34___________[4999 1 1]______[512 1 1]___________23__________________0.667
  3. 3894.75___________[4999 1 1]______[512 1 1]___________19__________________1

The first result refers to situation presented in code.
The second result was obtained with calling function without __forceinline, and I suppose that the function arguments needs additional 4 registers, and therefore the occupancy was limited by registers (I’ve checked it in occupancy calculator also).
And finally there are results for code without any functions, all code was copied several times and it looks like this:

	y = im1*(re2) - re1*(im2);
	x = re1*(re2) + im1*(im2);
	*(flow_out+index) = atan2f(y,x);

	re1 = (re1)*(re1)+ im1*(im1);
 	*(data_out+index) = sqrtf(re1);

	y = im2*(re3) - re2*(im3);
	x = re2*(re3) + im2*(im3);
	*(flow_out+index+windowLength) = atan2f(y,x);

And the answers for JFSebastian: generally I’m testing the time of kernel execution by utilizing windows library and functions:
it gives me more less the same time like CUDA Profiler… and it’s faster and easier… but when I’m looking for something strange like this calling a device functions I’m using CUDA Profiler.
In this case I didn’t take any averages because the difference in time in several runs are less than 1-2%.
Thank you for your suggestions with __forceinline command:) I’ve completely forgotten about it.
You told about different compilation units, please correct me if I’m wrong - when all these three functions are in the same file, I’m in the same compilation unit, is it true?

And generally, what do you think? Am I doing something wrong, or it is normal to lose this more than 200 us even if I’m using __forceinline.
If you think that there may be something wrong in the rest of code I can add it. I’ve shown the most important parts from my point of view to show it more clearly.

Thank you for your replies:)

I have two comments:

  1. There is a recent thread on this forum concerning timing CUDA kernels, see I would suggest to take a look at it. SPWorley was also recommending the use of clock64() for microbenchmarking purposes. It might be of interest to you.
  2. In CUDA language, forceinline requires a double underscore both before and after the keyword, that is, __forceinline__; __forceinline, without the double underscore after the keyword, as in your code snippet, is C++.

I have experience a successful use of forceinline for developing a CUDA library based on expression templates.

Finally, yes, you are in the case when the three functions are in the same compilation units.