when a application runs, the first execution of a kernel will spend a longer time than the second.
I have confused by this appearance. how to avoid it ?
It may be CUDA initialization time. It could also be JIT compile time, which is effectively part of initialization time. It could also be a cache population effect.
what is about the cache population effect, is there some books describe it ? thank you very much, txbob