When I change the number of loops in the kernel function, but the running time of the kernel function is almost unchanged

I tried N 1000 and N1000000, and the time was almost unchanged, here is the code and result
thanks for your help.
#define N 10000
global void kernel_1()
{
double sum = 0.0;
for (int i = 0; i < N; i++)
sum = sum + tan(0.1) * tan(0.1);

}

the kernal_1 - kernal_4 is the same
here is the nvprof result
N = 1000
Type Time(%) Time Calls Avg Min Max Name
28.19% 17.248us 16 1.0780us 992ns 2.0160us kernel_1(void)
23.95% 14.656us 16 916ns 896ns 1.2160us kernel_3(void)
23.95% 14.656us 16 916ns 864ns 1.2160us kernel_4(void)
23.90% 14.624us 16 914ns 864ns 1.2160us kernel_2(void)

N=1000000
Type Time(%) Time Calls Avg Min Max Name
28.20% 17.344us 16 1.0840us 992ns 2.0160us kernel_1(void)
23.93% 14.720us 16 920ns 896ns 1.2160us kernel_3(void)
23.93% 14.720us 16 920ns 864ns 1.2160us kernel_4(void)
23.93% 14.720us 16 920ns 864ns 1.2160us kernel_2(void)

please don’t post text as pictures here

I’m sorry, I don’t know the rule

Your code is being optimized away by the compiler because it is not affecting global state or observable results.

1 Like

thank you for answering my question, so if I want the program to execute N loops what I can do is disable the compiler optimizations. Is that right?

That’s not the approach I would take. It might work, it might not. But it’s not going to give you a measurement that I would consider reliable or useful. It’s not reliable because I’m not sure you can truly “disable all optimizations” and its not useful because its not measuring code that would actually be generated by the compiler when compiling for a production (optimized) target. Unless you actually intend to deliver production code with optimizations disabled, which I personally would not suggest.

Instead cause your code to modify global state:

__global__ void kernel_1(double *r)
{
  double sum = 0.0;
  for (int i = 0; i < N; i++)
    sum = sum + tan(0.1) * tan(0.1);
  size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
  r[idx] = sum;

}

Even then, the compiler may make some observation about the loop that allows it to collapse or eliminate the loop. For instance, you and I can immediately calculate the result as:

N*tan(0.1)*tan(0.1)

without any need for a loop. Don’t assume that the compiler cannot do that or won’t do that. This sort of performance benchmarking may require considerable care. There are many questions on various forums about performance benchmarking of CUDA code, and the struggles that may arise to combat the compiler.

I sometimes find it necessary to inspect the binary code generated by the compiler, in order to be confident that I am measuring what I think I am measuring.

To combat the compiler here, I might try something like this:

__global__ void kernel_1(double *r, double *start, int my_N)
{
  size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
  double sum = start[idx];
  for (int i = 0; i < my_N; i++)
    sum = sum + tan(sum) * tan(1.1*sum);
  r[idx] = sum;

}

I still haven’t tested that to see what it will do. And yes, I understand it may not be exactly what you want to measure. Sometimes it can be difficult to concoct a test to measure exactly what you want to measure. Good luck!

1 Like

appreciate your guidance, have a nice day

It seems to give plausible behavior:

$ cat t2105.cu
#include <cstdlib>

__global__ void kernel_1(double *r, double *start, int my_N)
{
  size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
  double sum = start[idx];
  for (int i = 0; i < my_N; i++)
    sum = sum + tan(sum) * tan(1.1*sum);
  r[idx] = sum;

}

int main(int argc, char *argv[]){


  const int ds = 1;
  double *r, *s;
  int N = 1000;
  if (argc > 1) N = atoi(argv[1]);
  cudaMalloc(&r, ds*sizeof(double));
  cudaMalloc(&s, ds*sizeof(double));
  cudaMemset(r, 0, sizeof(double));
  cudaMemset(s, 0, sizeof(double));
  kernel_1<<<1,1>>>(r, s, N);
  cudaDeviceSynchronize();
}
$ nvcc -o t2105 t2105.cu
$ compute-sanitizer ./t2105
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2105
==30554== NVPROF is profiling process 30554, command: ./t2105
==30554== Profiling application: ./t2105
==30554== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.17%  376.11us         1  376.11us  376.11us  376.11us  kernel_1(double*, double*, int)
                    0.83%  3.1370us         2  1.5680us  1.3450us  1.7920us  [CUDA memset]
      API calls:   97.34%  304.20ms         2  152.10ms  6.0470us  304.20ms  cudaMalloc
                    1.49%  4.6530ms         4  1.1632ms  595.25us  2.8309ms  cuDeviceTotalMem
                    0.86%  2.6898ms       404  6.6570us     315ns  266.95us  cuDeviceGetAttribute
                    0.14%  444.01us         4  111.00us  59.294us  252.11us  cuDeviceGetName
                    0.13%  412.80us         1  412.80us  412.80us  412.80us  cudaDeviceSynchronize
                    0.02%  49.130us         1  49.130us  49.130us  49.130us  cudaLaunchKernel
                    0.01%  32.822us         2  16.411us  5.3960us  27.426us  cudaMemset
                    0.01%  22.501us         4  5.6250us  3.0910us  9.0780us  cuDeviceGetPCIBusId
                    0.00%  9.6540us         8  1.2060us     420ns  4.4800us  cuDeviceGet
                    0.00%  3.4300us         4     857ns     637ns  1.2600us  cuDeviceGetUuid
                    0.00%  3.1300us         3  1.0430us     688ns  1.4400us  cuDeviceGetCount
$ nvprof ./t2105 100000
==30568== NVPROF is profiling process 30568, command: ./t2105 100000
==30568== Profiling application: ./t2105 100000
==30568== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  36.465ms         1  36.465ms  36.465ms  36.465ms  kernel_1(double*, double*, int)
                    0.01%  3.2000us         2  1.6000us  1.3760us  1.8240us  [CUDA memset]
      API calls:   89.69%  386.34ms         2  193.17ms  9.0400us  386.33ms  cudaMalloc
                    8.47%  36.473ms         1  36.473ms  36.473ms  36.473ms  cudaDeviceSynchronize
                    1.16%  5.0014ms         4  1.2504ms  575.62us  3.2349ms  cuDeviceTotalMem
                    0.57%  2.4416ms       404  6.0430us     297ns  270.53us  cuDeviceGetAttribute
                    0.09%  373.23us         4  93.307us  59.767us  190.16us  cuDeviceGetName
                    0.01%  61.702us         1  61.702us  61.702us  61.702us  cudaLaunchKernel
                    0.01%  41.713us         2  20.856us  7.2530us  34.460us  cudaMemset
                    0.01%  21.623us         4  5.4050us  3.0350us  11.475us  cuDeviceGetPCIBusId
                    0.00%  9.3370us         8  1.1670us     408ns  4.1640us  cuDeviceGet
                    0.00%  3.3720us         4     843ns     613ns  1.2360us  cuDeviceGetUuid
                    0.00%  3.1740us         3  1.0580us     526ns  1.6920us  cuDeviceGetCount
$
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.