Hi, first of all thank you for your replies:)
I would like to find the true, and I spend several hours for testing. I would like to show you my results.
Now I simplified the code to solve the problem only and now the time of execution is smaller than in previous post.
The function about I wrote in previous post was prepared for calculation of amplitude and phase difference between samples after Fourier transformation. In previous version I had only one function which has returned type “void” and the results were returned by reference to the variable (argument). Generally it is not a good idea - it slowed down the kernel dramatically. I decided to prepare two simple function without reference:
__forceinline __device__ float calculateAmplitude(float re1, float im1, float re2,float im2)
{
re1 = re1*re1+ im1*im1;
return sqrtf(re1);
}
__forceinline __device__ float calculatePhase(float re1, float im1, float re2,float im2)
{
register float x,y;
y = im1*re2 - re1*im2;
x = re1*re2 + im1*im2;
return atan2f(y,x);
}
And now the most important parts of kernel function code:
__global__ void kernelFunction(...)
{
// calculation of indexes and taking data from device memory
*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);
*(data_out+index) = calculateAmplitude(re1,im1,re2,im2);
*(data_out+index+windowLength) = calculateAmplitude(re2,im2,re3,im3);
// index increment, and taking the new data
*(phase_out+index) = calculatePhase(re1,im1,re2,im2);
*(phase_out+index+windowLength) = calculatePhase(re2,im2,re3,im3);
*(data_out+index) = calculateAmplitude(re1,im1,re2,im2);
*(data_out+index+windowLength) = calculateAmplitude(re2,im2,re3,im3);
}
And now the most interesting - the results from three tests:
Information from CUDA Profiler:
__GPU Time[us]_______grid_size______th. block size______reg. per thread_____occupancy
- 4109.79___________[4999 1 1]______[512 1 1]___________19__________________1
- 4685.34___________[4999 1 1]______[512 1 1]___________23__________________0.667
- 3894.75___________[4999 1 1]______[512 1 1]___________19__________________1
The first result refers to situation presented in code.
The second result was obtained with calling function without __forceinline, and I suppose that the function arguments needs additional 4 registers, and therefore the occupancy was limited by registers (I’ve checked it in occupancy calculator also).
And finally there are results for code without any functions, all code was copied several times and it looks like this:
...
y = im1*(re2) - re1*(im2);
x = re1*(re2) + im1*(im2);
*(flow_out+index) = atan2f(y,x);
re1 = (re1)*(re1)+ im1*(im1);
*(data_out+index) = sqrtf(re1);
y = im2*(re3) - re2*(im3);
x = re2*(re3) + im2*(im3);
*(flow_out+index+windowLength) = atan2f(y,x);
...
And the answers for JFSebastian: generally I’m testing the time of kernel execution by utilizing windows library and functions:
QueryPerformanceFrequency(&countsPerSec);
QueryPerformanceCounter(&tim1);
it gives me more less the same time like CUDA Profiler… and it’s faster and easier… but when I’m looking for something strange like this calling a device functions I’m using CUDA Profiler.
In this case I didn’t take any averages because the difference in time in several runs are less than 1-2%.
Thank you for your suggestions with __forceinline command:) I’ve completely forgotten about it.
You told about different compilation units, please correct me if I’m wrong - when all these three functions are in the same file, I’m in the same compilation unit, is it true?
And generally, what do you think? Am I doing something wrong, or it is normal to lose this more than 200 us even if I’m using __forceinline.
If you think that there may be something wrong in the rest of code I can add it. I’ve shown the most important parts from my point of view to show it more clearly.
Thank you for your replies:)