I tried N 1000 and N1000000, and the time was almost unchanged, here is the code and result
thanks for your help. #define N 10000 global void kernel_1()
{
double sum = 0.0;
for (int i = 0; i < N; i++)
sum = sum + tan(0.1) * tan(0.1);
}
the kernal_1 - kernal_4 is the same
here is the nvprof result
N = 1000
Type Time(%) Time Calls Avg Min Max Name
28.19% 17.248us 16 1.0780us 992ns 2.0160us kernel_1(void)
23.95% 14.656us 16 916ns 896ns 1.2160us kernel_3(void)
23.95% 14.656us 16 916ns 864ns 1.2160us kernel_4(void)
23.90% 14.624us 16 914ns 864ns 1.2160us kernel_2(void)
N=1000000
Type Time(%) Time Calls Avg Min Max Name
28.20% 17.344us 16 1.0840us 992ns 2.0160us kernel_1(void)
23.93% 14.720us 16 920ns 896ns 1.2160us kernel_3(void)
23.93% 14.720us 16 920ns 864ns 1.2160us kernel_4(void)
23.93% 14.720us 16 920ns 864ns 1.2160us kernel_2(void)
That’s not the approach I would take. It might work, it might not. But it’s not going to give you a measurement that I would consider reliable or useful. It’s not reliable because I’m not sure you can truly “disable all optimizations” and its not useful because its not measuring code that would actually be generated by the compiler when compiling for a production (optimized) target. Unless you actually intend to deliver production code with optimizations disabled, which I personally would not suggest.
Instead cause your code to modify global state:
__global__ void kernel_1(double *r)
{
double sum = 0.0;
for (int i = 0; i < N; i++)
sum = sum + tan(0.1) * tan(0.1);
size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
r[idx] = sum;
}
Even then, the compiler may make some observation about the loop that allows it to collapse or eliminate the loop. For instance, you and I can immediately calculate the result as:
N*tan(0.1)*tan(0.1)
without any need for a loop. Don’t assume that the compiler cannot do that or won’t do that. This sort of performance benchmarking may require considerable care. There are many questions on various forums about performance benchmarking of CUDA code, and the struggles that may arise to combat the compiler.
I sometimes find it necessary to inspect the binary code generated by the compiler, in order to be confident that I am measuring what I think I am measuring.
To combat the compiler here, I might try something like this:
__global__ void kernel_1(double *r, double *start, int my_N)
{
size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
double sum = start[idx];
for (int i = 0; i < my_N; i++)
sum = sum + tan(sum) * tan(1.1*sum);
r[idx] = sum;
}
I still haven’t tested that to see what it will do. And yes, I understand it may not be exactly what you want to measure. Sometimes it can be difficult to concoct a test to measure exactly what you want to measure. Good luck!