Help with concurrency.. Not any improvement in total cycle time

RoelYoung · November 3, 2017, 12:09pm

Good afternoon all,

I am working on a project where all CUDA calls are serial, i was thinking about to parallelize everything to get a shorter cycle time.
After some experiments the DLL i made was working perfectly with concurrency, but instead of a shorter cycle, the time went up by approximatly 10%.

To visualize everything, a screenshot :)

Ps. I am not able to post images yet, sorry for a link :)

On the top you can see the serial cycle, which takes about 0,92ms from start call to the last ‘memcopy’.
On the bottomo you can see the cycle with concurrency, this cycle will takes about 1,15ms from start call to the last ‘memcopy’.

Now i can see that the ‘cycle’ itself, when i check the streams is shorter. But i looks like it takes longer untill everything starts.
I also know that my mem copies form the device to host take too much time compared to the rest of the cycle. This is something i will have a look on later ;)

Can anyone advise me, if i am doing something wrong, what i have to change.
Is it normal that is takes about half the cycle time from start of ‘Run time API’ call to where the actual cycle starts in CUDA?

I do run everything of this normally on a GTX950, this test was done on my Mobile GTX960. If someone needs additional information feel free to ask, i will respond as soon as posible :)

Thanks in advance!

RoelYoung · November 7, 2017, 12:12pm

No one :( ?

njuffa · November 7, 2017, 3:43pm

Generally speaking, people on the internet are not standing by waiting to answer your questions. That’s what paid support is for.

Without code to look at, it’s impossible to comments on your approach in detail. What are “CUDA calls”? Kernel launches? How are you trying to “parallelize CUDA calls”? Multiple threads launching kernels on the same GPU? If you have multiple threads accessing a shared resource, such as a GPU, you are likely incurring synchronization overhead. So if the shared resource is already the performance limiting factor, this approach may decrease overall performance.

As long as your kernels have a sufficient number of thread blocks to utilize the entire GPU, a kernel uses 100% of the GPU, so kernel work overlapping other kernel work happens rarely. If you want to overlap host->device or device->host copies with the kernel work, use CUDA streams, and couple it with a double buffering scheme if necessary. If you have a need to overlap host->device with device->host copies, make sure to use a GPU with dual copy engines.

If host-side overhead is an issue for your CUDA code, use a CPU with very high single-thread performance (a base frequency > 3.5 GHz recommended at this time), a lot of the host-side overhead typically consists of operating system API calls that work in serial fashion.

Topic		Replies	Views
Latency when running a cuda code CUDA Programming and Performance	10	3394	December 30, 2020
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1437	July 17, 2017
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	221	July 7, 2024
What happens to my CPU thread? CUDA Programming and Performance	1	1262	August 19, 2009
Multiple threads calling CUDA API in parallel CUDA Programming and Performance cuda , driver , parallel-computing	4	220	August 9, 2024
performance variation when using asynchronous calls CUDA Programming and Performance	1	621	February 11, 2011
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25073	March 8, 2010
Comparison of a CUDA kernel performance running on different GPUs/Toolkits/Drivers CUDA Programming and Performance	2	933	July 7, 2014
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
Issue with running CPU and GPU code Asynchronously CUDA Programming and Performance	0	3168	June 8, 2011

Help with concurrency.. Not any improvement in total cycle time

Related topics