Hi! I’m the beginner of CUDA Programing.
I make my own CUDA code for solving 2D Poisson Eqn which use Jacobi method.
For solving Poisson Eqn, I should iterate my own Poisson solving kernel for 1000 times.
Code is like this.
global void Poisson_Solving(double *phip,double *phim,double rhs)
{
int i=threadIdx.x;
int j=blockIdx.x;
int index=i+jblockDim.x;
global void Phi_Copy(double *phip,double phim)
{
int i=threadIdx.x;
int j=blockIdx.x;
int index=i+jblockDim.x;
phim[index]=phip[index];
}
int main (void)
{
:
for (s=0; s<1000, s++) {
Poisson_Solving<<<numBlock,numThread>>>(phip,phim,rhs);
Phi_Copy<<<numBlock,numThread>>>(phip,phim);
}
:
}
Result is very satisfying, But problem is kernel execution time.
I check kernel execution time for every iteration.
During First 500 iteration, kernel execution time is almost 0.00001 sec.
But, during the remaing 500 iteration, kernel execution time is almost 0.0003 sec.
kernel execution time is jumped up to 30 times.
Why is this phenomenon is happening?
There is something wrong in my code for thread hierarchy or kernel execution sequence???
this is indeed odd, as I would expect runtime for simple arithmetic operations to be constant. CUDA handles denormal numbers at full speed - and you are not calling any trigonometric or transcendental functions that may implement a slower code path for specific numeric ranges.
What you may be seeing is the kernel launch queue getting full. Are you running on Windows or Linux? How exactly were you measuring the kernel run time? The most precise method is with cudaEvents as detailed here: https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/
Note that kernel launches are asynchronous in general, unless for example the kernel launch queue is full and the launches become blocking until the queue can take more kernels.
You could be attempting to use dynamic parallelism to move the for loop with the kernel launches into another kernel.
I’m running my code on Linux.
As you say, I also think the kernel launch queue getting full.
Is there any way for preventing kernel launch queue getting full?
And, What is the benefit of using dynamic parallelism?
Adding a cudaDeviceSynchronize() after the two kernel calls would ensure that the kernels have finished executing before entering the next iteration. But overall it may hurt performance slightly.
Dynamic parallelism allows the CUDA code to launch kernels. It may shift some of the kernel launch overhead into the GPU itself. From the perspective of the CPU you would be only launching a single kernel that performs the 1000 iterations.