Make my kernel run for longer !

dorra.boughzalayuhqo · August 8, 2019, 3:37pm

Hello everyone,

I have a confusion about where to put my iteration to make it run longer, and I want to know the difference between those 2 approaches:

Iterations inside the global function: for example vector addition

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int nIter=1000;

    for (int k =0; k < nIter ; k++)
    {
          if (i < numElements)
          {
              C[i] = A[i] + B[i];
          }
          __syncthreads();
    }

}

outside the global function, in my main like this :

int nIter=1000;
   cudaEventCreate(&start);
   cudaEventCreate(&stop);
   cudaEventRecord(start, 0);

   for (int j = 0; j < nIter; j++)
      {
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements) ;
      }

   cudaEventRecord(stop, 0);
   cudaEventSynchronize(stop);
   cudaEventElapsedTime(&ms, start, stop);
   cudaEventDestroy(start);
   cudaEventDestroy(stop);

Q1: there is no overhead for the second approach ?
Q2: Or do you think it similar ?
Q3: which approach to adopt to be sure that I’m repeating my kernel ?

Thank you in advance for your response,
Dorra

Robert_Crovella · August 8, 2019, 3:49pm

Neither one of these approaches is doing anything sensible for iterations beyond the first. The first approach will launch (at least/presumably) a single kernel that takes a longer time. The second approach will launch many kernels, each of which takes a shorter time, but in aggregate the time consumed may be longer than this first approach.

If you want to “repeat” a kernel, you would need a loop similar to the one in your second approach. Since you have not shown the host code for the first, or the device code for the second, my answer is imprecise.

Rather than asking about it here, if you learn to use a GPU profiler, the differences in your approaches will be readily evident with some study.

Topic		Replies	Views
Extremely high number of iterations CUDA Programming and Performance	5	1467	February 14, 2013
Kernal function in a loop. is it fine? CUDA Programming and Performance	6	1908	May 12, 2009
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6879	September 25, 2008
Kernel takes more and more time at each iteration CUDA Programming and Performance	0	838	May 24, 2011
Persistent kernel runs slower when with more threads CUDA Programming and Performance	7	213	October 2, 2024
Is it recommended to throw multiple kernels at once? CUDA Programming and Performance cuda , kernel	6	2903	October 12, 2021
Performance hit in CUDA program that calls kernel repeatedly within a for loop CUDA Programming and Performance	2	4060	January 6, 2012
loop inside a kernel How many interrations? CUDA Programming and Performance	3	3266	July 20, 2009
Delay between multiple kernel calls CUDA Programming and Performance	3	649	October 12, 2021
Optimu way for this code? CUDA Programming and Performance	3	1177	March 13, 2009

Make my kernel run for longer !

Related topics