Thrust functions in parallel CPU threads get linearly slower with more threads

michael.friedman · September 24, 2021, 6:20pm

I’m new to GPU programming, so forgive me if this sounds naive. I have some parallel C++ code that takes 1-4 independent data channels, and tries to do some GPU operations on them in parallel like so:

for (auto &channel: channels) {
      // copy data from host channel to thrust device vector
      // do some thrust::transform calls on it
      // copy back to host
}

I compile my program with --default-stream per-thread. The thread for each channel pushes a single vector of floats to the GPU, of length 300 000 - 400 000, and then creates 3-4 more vectors of the same size within its loop iteration.

I would expect the parallel for loop to mean that my code doesn’t get linearly slower with the number of channels. But it seems to - the more channels I have, the more time that the various thrust functions I use in the loop (thrust::fill, thrust::transform and thrust::copy) take.

Why is this? Is it because using --default-stream per-thread doesn’t work with Thrust out the box? Is it because the amount of data I’m handling might be enough to saturate the GPU’s resources (this is on a Jetson Xavier AGX), and if so is there anything I can do to change that (e.g. a custom Thrust allocator)? Or is it something else entirely?

Robert_Crovella · September 24, 2021, 6:34pm

I don’t know what the basis for that is. The data I see here suggests that a single channel could fully utilize a GPU. If a single channel fully utilized the GPU, why would adding more channels not result in a linear slowdown?

njuffa · September 24, 2021, 6:50pm

Specifically, from the description provided it appears that a single channel is able to completely utilize the most limiting GPU resource (a.k.a. bottleneck), which is either PCIe bandwidth or memory bandwidth. In such a scenario, having other GPU resources (such as execution units) that are not yet fully utilized with a single channel doesn’t provide any benefits when scaling up to multiple channels.

I would suggest using the CUDA profiler to unambiguously identify the bottleneck(s) here. From the rough description provided, it seems that is data movement. If that is confirmed by the profiler, the goal would be to minimize data movement. Maybe some of the data is compressible in a fairly trivial way?