I have a multithreaded application where each thread is detached and therefore operating entirely independently. It’s very basic, just pull some numbers out of an algorithm and then throw them at CUDA for some parallel processing. At the start of execution, the program divides up it’s dataset into N slices, and I typically give the program N threads to work with, with the theory being that it would reduce processing time by something approaching a factor of N.
However when I time my program, I discover that giving the program only 1 thread actually performs faster 100% of the time than giving it one thread per data slice. 1 Thread is actually able to process every data slice about 24% faster than running multiple threads.
Given that the data generation algorithm is entirely CPU bound and is far removed from the hot path, this makes me think that calling kernel functions is acting as blocking, so when one thread calls the GPU for a compute, it blocks any other thread while waiting for a result, essentially bottle necking threads around the GPU interface. That’s the only explanation I can think of for the parallel execution ending up not only taking as long, but longer.
Is this how it works? If so is there any way to decrease the amount of bottle necking of threads interacting with the GPU? This is my first and only CUDA project thus far, so beginner level documentation or resources are appreciated.
I am not sure I understand the description correctly. CUDA kernel invocations are non-blocking. If one thread is already able to utilize a resource fully, splitting that work across N thread using that resource will likely cause a decrease in performance, as now there is overhead from coordinating the access of N threads to that shared resource.
You should be able to gain insights into what is going on by using the CUDA profiler. Have you tried it?
One reason I know of to break a workload up as you are describing is to enable copy/compute overlap. The other reason I can think of is if you have multiple GPUs. No idea if you have done that or if it is even relevent for your code. But if you haven’t done a design for copy/compute overlap, and/or don’t have multiple GPUs to feed, I would reiterate what njuffa said, that breaking up a kernel into N pieces is not a good idea.
Each thread gets a seed that it uses to generate data which it sends to the GPU for processing. In the single threaded mode, the same thread finishes the first seed, moves to the second, so on. On the computation side, they should be completely independent asides from RAM access, and of course GPU usage.
How can I use insight or another tool to see all the functions running on the GPU that my process spawns? It would be really useful to see creation and deletion times along with identifiers.
There may well be some other resource that I don’t know about yet that is blocking but being able to see a history of what is running on the GPU would help. I would expect if I had some shared resource for a parallel computing I would be somewhat slower than 1/N time, but going slower than series processing makes me think something is blocking.
In my program, each thread has it;s own completely separate input dataset but is invoking it’s own kernel calls. Are you saying that the kernel can’t handle multiple requests like that and ends up blocking invocations that it isn’t ready for?
nsight systems is the tool you want to start with
here is an introductory blog/tutorial
If you have questions about how to use the profiler, I suggest asking those on the nsight systems forum
I started up NSight and found that when I run with multiple threads, each thread has a large amount of time spent in a blocked state. The blocked state is constantly spiked on the timeline.
After a brief look around I think this is because each thread is calling cudaMemcpy through the default stream. So I need to figure out a way to use multiple streams? I actually want the CPU to wait until the GPU returns a calculation, so an async memcpy isn’t what I’m looking for (I think) unless that is what allows me to use multiple streams.
 Yes, you would want to use multiple non-default streams
 Yes, you would want to use async memcpys along with the multiple streams
 No, this won’t increase performance unless a single-threaded version is not able to fully exploit available GPU resources. You could profile the single-threaded version to find out whether that is the case or not.
I tried turning on per thread default stream and it tanked my application. Interesting.
Why do I need to async memcpy with multiple streams?
What are the metrics I should be looking at for whether or not the GPU is being underutilized by a singel thread?
BTW for people doing multithreaded kernel launches, I strongly encourage them to use the latest released CUDA version (currently 11.4.1). Recent CUDA versions have had some incremental improvements in this area, here is an example.
Forum threads are no substitute for tutorials and/or working through examples with the assistance of documentation. Maybe the following can get you started in the right direction:
That very page is where I found out that I needed to use multiple streams, but using default stream per thread still is not preventing blocking.
All the RTFM in the world does not help finding the right manual sadly.