Kernel Functions Blocking Multithreaded Application?

aven_omega · August 12, 2021, 7:46pm

I have a multithreaded application where each thread is detached and therefore operating entirely independently. It’s very basic, just pull some numbers out of an algorithm and then throw them at CUDA for some parallel processing. At the start of execution, the program divides up it’s dataset into N slices, and I typically give the program N threads to work with, with the theory being that it would reduce processing time by something approaching a factor of N.

However when I time my program, I discover that giving the program only 1 thread actually performs faster 100% of the time than giving it one thread per data slice. 1 Thread is actually able to process every data slice about 24% faster than running multiple threads.

Given that the data generation algorithm is entirely CPU bound and is far removed from the hot path, this makes me think that calling kernel functions is acting as blocking, so when one thread calls the GPU for a compute, it blocks any other thread while waiting for a result, essentially bottle necking threads around the GPU interface. That’s the only explanation I can think of for the parallel execution ending up not only taking as long, but longer.

Is this how it works? If so is there any way to decrease the amount of bottle necking of threads interacting with the GPU? This is my first and only CUDA project thus far, so beginner level documentation or resources are appreciated.

njuffa · August 12, 2021, 10:00pm

I am not sure I understand the description correctly. CUDA kernel invocations are non-blocking. If one thread is already able to utilize a resource fully, splitting that work across N thread using that resource will likely cause a decrease in performance, as now there is overhead from coordinating the access of N threads to that shared resource.

You should be able to gain insights into what is going on by using the CUDA profiler. Have you tried it?

Robert_Crovella · August 12, 2021, 11:16pm

One reason I know of to break a workload up as you are describing is to enable copy/compute overlap. The other reason I can think of is if you have multiple GPUs. No idea if you have done that or if it is even relevent for your code. But if you haven’t done a design for copy/compute overlap, and/or don’t have multiple GPUs to feed, I would reiterate what njuffa said, that breaking up a kernel into N pieces is not a good idea.

aven_omega · August 13, 2021, 12:53am

Each thread gets a seed that it uses to generate data which it sends to the GPU for processing. In the single threaded mode, the same thread finishes the first seed, moves to the second, so on. On the computation side, they should be completely independent asides from RAM access, and of course GPU usage.

How can I use insight or another tool to see all the functions running on the GPU that my process spawns? It would be really useful to see creation and deletion times along with identifiers.

There may well be some other resource that I don’t know about yet that is blocking but being able to see a history of what is running on the GPU would help. I would expect if I had some shared resource for a parallel computing I would be somewhat slower than 1/N time, but going slower than series processing makes me think something is blocking.

In my program, each thread has it;s own completely separate input dataset but is invoking it’s own kernel calls. Are you saying that the kernel can’t handle multiple requests like that and ends up blocking invocations that it isn’t ready for?

Robert_Crovella · August 13, 2021, 12:57am

nsight systems is the tool you want to start with

here is an introductory blog/tutorial

If you have questions about how to use the profiler, I suggest asking those on the nsight systems forum

aven_omega · August 14, 2021, 11:14pm

I started up NSight and found that when I run with multiple threads, each thread has a large amount of time spent in a blocked state. The blocked state is constantly spiked on the timeline.

After a brief look around I think this is because each thread is calling cudaMemcpy through the default stream. So I need to figure out a way to use multiple streams? I actually want the CPU to wait until the GPU returns a calculation, so an async memcpy isn’t what I’m looking for (I think) unless that is what allows me to use multiple streams.

njuffa · August 14, 2021, 11:27pm

[1] Yes, you would want to use multiple non-default streams
[2] Yes, you would want to use async memcpys along with the multiple streams
[3] No, this won’t increase performance unless a single-threaded version is not able to fully exploit available GPU resources. You could profile the single-threaded version to find out whether that is the case or not.

aven_omega · August 14, 2021, 11:40pm

I tried turning on per thread default stream and it tanked my application. Interesting.

Why do I need to async memcpy with multiple streams?

What are the metrics I should be looking at for whether or not the GPU is being underutilized by a singel thread?

Robert_Crovella · August 14, 2021, 11:49pm

BTW for people doing multithreaded kernel launches, I strongly encourage them to use the latest released CUDA version (currently 11.4.1). Recent CUDA versions have had some incremental improvements in this area, here is an example.

njuffa · August 15, 2021, 12:24am

Forum threads are no substitute for tutorials and/or working through examples with the assistance of documentation. Maybe the following can get you started in the right direction:

https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

aven_omega · August 15, 2021, 5:18am

That very page is where I found out that I needed to use multiple streams, but using default stream per thread still is not preventing blocking.

All the RTFM in the world does not help finding the right manual sadly.

Topic		Replies	Views
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3053	January 19, 2018
Multiple Streams Performance CUDA Programming and Performance	9	6394	October 19, 2010
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6036	July 23, 2015
Cannot get any stream parallelism. CUDA Programming and Performance	13	1278	December 31, 2019
reasons why splitting large kernel to smaller one lower perfromance CUDA Programming and Performance	4	3709	February 15, 2016
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1452	September 14, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1771	March 6, 2018
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5547	April 28, 2012
Multiple kernels in flight? CUDA Programming and Performance	19	26833	August 28, 2007

Kernel Functions Blocking Multithreaded Application?

Related topics