I am facing a sporadic behavior in some of the matlab GPU toolkit algorithm. I have an NVidia GForce GTX 860M on my laptop. The problem is this: -
There is a delay of about one second in the first GPU computation as compared to subsequent ones. Since the first run is slower, I predicted cache preparations take might be the culprit. For CPU, this behavior is well known. But I am not sure why does the 1st run on GPU also gives a slower results. Will a CUDA implementation also have a delay in the first run? What do we call this problem? Is there any way to formally address the problem other than running the problem couple of times?
This is a known issue with MATLAB, not CUDA. I have talked directly to the MATHWORKS people and they acknowledged the issue but did not offer a fix.
Are you calling a CUDA through a mex interface or using MATLAB’s built in GPU functionality?
After that first call the overhead of a CUDA mex file any additional latency becomes very small, so at worst you just have to get past the first ‘initialization’ call.
I am if I need to formally mention this aberration in my literature, what should I say? Can we say it is due to the “cache warm-up”?
I am using the built in GPU functionality for now. However, I intend to write my own CUDA code to avoid such variations. I hope this I can avoid the this fluctuations if I write my own code, any heads-up for the bad practice?
Thanks for the suggestion, I will not consider the first run in my results then.
I do not use MATLAB, but as a guess the “hickup” on first GPU use is likely some kind of first-use software initialization overhead, rather than anything related to caches in particular.
I will note that CUDA itself, as a stateful software layer, has an initialization overhead caused by context creation. This can pe particularly noticeable in systems with large memory or when a significant amount of JIT compilation takes place. Since context initialization is usually lazy, triggered by the first CUDA API call, often a cudaMalloc(), it may be convenient to trigger this event explicitly at an opportune time, by a call to cudaFree(0) for example.