Good afternoon all,
I am working on a project where all CUDA calls are serial, i was thinking about to parallelize everything to get a shorter cycle time.
After some experiments the DLL i made was working perfectly with concurrency, but instead of a shorter cycle, the time went up by approximatly 10%.
To visualize everything, a screenshot :)
Ps. I am not able to post images yet, sorry for a link :)
On the top you can see the serial cycle, which takes about 0,92ms from start call to the last ‘memcopy’.
On the bottomo you can see the cycle with concurrency, this cycle will takes about 1,15ms from start call to the last ‘memcopy’.
Now i can see that the ‘cycle’ itself, when i check the streams is shorter. But i looks like it takes longer untill everything starts.
I also know that my mem copies form the device to host take too much time compared to the rest of the cycle. This is something i will have a look on later ;)
Can anyone advise me, if i am doing something wrong, what i have to change.
Is it normal that is takes about half the cycle time from start of ‘Run time API’ call to where the actual cycle starts in CUDA?
I do run everything of this normally on a GTX950, this test was done on my Mobile GTX960. If someone needs additional information feel free to ask, i will respond as soon as posible :)
Thanks in advance!