Help with concurrency.. Not any improvement in total cycle time

Good afternoon all,

I am working on a project where all CUDA calls are serial, i was thinking about to parallelize everything to get a shorter cycle time.
After some experiments the DLL i made was working perfectly with concurrency, but instead of a shorter cycle, the time went up by approximatly 10%.

To visualize everything, a screenshot :)

Ps. I am not able to post images yet, sorry for a link :)

On the top you can see the serial cycle, which takes about 0,92ms from start call to the last ‘memcopy’.
On the bottomo you can see the cycle with concurrency, this cycle will takes about 1,15ms from start call to the last ‘memcopy’.

Now i can see that the ‘cycle’ itself, when i check the streams is shorter. But i looks like it takes longer untill everything starts.
I also know that my mem copies form the device to host take too much time compared to the rest of the cycle. This is something i will have a look on later ;)

Can anyone advise me, if i am doing something wrong, what i have to change.
Is it normal that is takes about half the cycle time from start of ‘Run time API’ call to where the actual cycle starts in CUDA?

I do run everything of this normally on a GTX950, this test was done on my Mobile GTX960. If someone needs additional information feel free to ask, i will respond as soon as posible :)

Thanks in advance!

No one :( ?

Generally speaking, people on the internet are not standing by waiting to answer your questions. That’s what paid support is for.

Without code to look at, it’s impossible to comments on your approach in detail. What are “CUDA calls”? Kernel launches? How are you trying to “parallelize CUDA calls”? Multiple threads launching kernels on the same GPU? If you have multiple threads accessing a shared resource, such as a GPU, you are likely incurring synchronization overhead. So if the shared resource is already the performance limiting factor, this approach may decrease overall performance.

As long as your kernels have a sufficient number of thread blocks to utilize the entire GPU, a kernel uses 100% of the GPU, so kernel work overlapping other kernel work happens rarely. If you want to overlap host->device or device->host copies with the kernel work, use CUDA streams, and couple it with a double buffering scheme if necessary. If you have a need to overlap host->device with device->host copies, make sure to use a GPU with dual copy engines.

If host-side overhead is an issue for your CUDA code, use a CPU with very high single-thread performance (a base frequency > 3.5 GHz recommended at this time), a lot of the host-side overhead typically consists of operating system API calls that work in serial fashion.