Unexpected slow-down on executing kernel in CUDA

I have a CUDA file kernel.cu and a c++ file program.cpp. Here is the portion of code:

int8_t compute(uint64_t *buff){
   cudaError_t rtnval;
   ulong *d_buf;
   rtnval = cudaMalloc((void**)&d_buf, BLOBSIZE*sizeof(ulong));
      goto label;
   rtnval = cudaMemcpy(d_buf, (ulong*) buff, BLOBSIZE*sizeof(ulong), cudaMemcpyHostToDevice);
      goto label;
   krak<<<blocks,threads>>>(d_buf);    //kernel
   memset(buff, 0x00, sizeof(uint64_t)*BLOBSIZE);
   rtnval = cudaMemcpy(buff,(uint64_t*) d_buf, BLOBSIZE*sizeof(ulong), cudaMemcpyDeviceToHost);
      goto label;
   return 0x00;
	   char msg[100]={0x00};
	   sprintf(msg, "error: %s",cudaGetErrorName(rtnval));
	   printf("%s\n", msg);
	   return 0x01;

When I called compute() function of kernel.cu file from main() function in file program.cpp then it takes n seconds for completion of this function.
Now when I called compute() function of kernel.cu file by writing main() function in file kernel.cu and directly calling it from main() in kernel.cu file then it takes m seconds for completion.
But here is the unexpected thing happened that n comes very much greater than m like n comes out to be 25 sec and m comes to be 0.18 sec.
What can be the reason for this ?

Without a reproducible code and/or complete compilation commands no one can tell you for sure. That being said, a few things could be happening:

  1. If the code called from a separate file is called from a linked library, make sure that you add the respective -O2 / -O3 flags, especially if doing a separate compilation of C++/CUDA host code. This would slow down memory copies – experienced this specifically recently.

  2. Make sure that you are compiling in release mode vs debug mode in both case (no -g or -G flags for debug symbols, etc)

In short, compare the actual command lines used for compiling in both cases, and you’ll sure find that the arguments are different / mismatched.

I got exactly the same issue as yours.
Basically if you call a kernel from a wrapper function instead of calling it directly, you get slow down,
even if the wrapper function is an inline function.
I don’t understand why. Wonder if you have figured it out.
Thank you!

Xuhao Chen