Hi to all
I am currently trying to add cuda computations in a multithreaded application which uses mainly two types of computations, one on a large dataset (2.6MB) and one on a little (330KB).
There is basically one OS thread for large computation and one OS thread for the little ones, and each of them implement the following model: copy to device → call kernel → copy to host
When the “little” thread makes its API calls alone, it is “quite” fast: around 150µs for each copies and 60µs for kernel.
The first thing I don’t understand is why API calls are so slow, in the mean time on the GPU profiling window, the kernel executes in only 10 µs, and the copy is performed in only 30 µs.
But when both little and large threads are making their calls in the same time, although they target different streams, the little thread API calls duration is even worst (1000µs), just as they where waiting for the large thread API calls to finish (cudamemcpyAsync on the picture)
I must add, that I am using cuda 5.0, with GTX680 and driver 313.18 under linux. Sorry by the way, for the picture quality …
I was hoping for someone to tell me, if this situation is an occurence of the implicit synchronisation that is descibed here
And what are the workaround, because I already dimished the number of threads of my application to the limits, and that helped me to save a bit of time on other calls.
Thank you very much for your help