I have a multithreaded application where each thread is detached and therefore operating entirely independently. It’s very basic, just pull some numbers out of an algorithm and then throw them at CUDA for some parallel processing. At the start of execution, the program divides up it’s dataset into N slices, and I typically give the program N threads to work with, with the theory being that it would reduce processing time by something approaching a factor of N.
However when I time my program, I discover that giving the program only 1 thread actually performs faster 100% of the time than giving it one thread per data slice. 1 Thread is actually able to process every data slice about 24% faster than running multiple threads.
Given that the data generation algorithm is entirely CPU bound and is far removed from the hot path, this makes me think that calling kernel functions is acting as blocking, so when one thread calls the GPU for a compute, it blocks any other thread while waiting for a result, essentially bottle necking threads around the GPU interface. That’s the only explanation I can think of for the parallel execution ending up not only taking as long, but longer.
Is this how it works? If so is there any way to decrease the amount of bottle necking of threads interacting with the GPU? This is my first and only CUDA project thus far, so beginner level documentation or resources are appreciated.