Reduction of kernel's execution time that does not make sense


I have 2 kernels, kernel1 and kernel2. kernel1 's time is 88.76 ms and kernel2 's time is 4.78 ms. But when kernel1 runs exactly before kernel2, then kernel2 's time goes to 2.56 ms.
Why is this happening? The two kernels uses differents arrays to execute.
Is there any idea?

Thank you in advance!


managed memory effect

incorrect measurement

code defect

some aspect of cuda startup overhead, such as jit compilation

Hi txbob!! Thainks for your answer!

  1. How can caching influence the kernel2 ‘s time when the kernel1’ s an kernel2 's data are different?
    How can the kernel2 's data be cached before the kernel2 's execution?

  2. do you mean tha way i manage the host and device memory?

  3. do you mean incorrect measurement of execution time? No, this is not possible.

  4. i don’t think there is a code defect since the results of both of the kernels are all correct.

  5. i can not understand this. You would help me a lot if you could explain it.

Sorry for the number of question, but i am new to cuda and this is very weird to me.

  1. Data can be cached by cudaMemcpy, or anything that touches memory. If the sequence of cudaMemcpy operations changes as a result of the kernel reordering, that could affect things.

  2. I mean managed memory. If you don’t know what managed memory (try googling “CUDA managed memory”, then take any of the first 5 hits, rather than making me explain every single term), you probably arent using it.

  3. OK

  4. OK

  5. CUDA has various kinds of startup time that must be incurred in any program. CUDA lazy initialization allows this to be smeared over the beginning of your program.

If there’s something I haven’t explained, you might want to try google first.

ok. thainks!