I’m new to GPU programming, so forgive me if this sounds naive. I have some parallel C++ code that takes 1-4 independent data channels, and tries to do some GPU operations on them in parallel like so:
for (auto &channel: channels) {
// copy data from host channel to thrust device vector
// do some thrust::transform calls on it
// copy back to host
}
I compile my program with --default-stream per-thread
. The thread for each channel pushes a single vector of floats to the GPU, of length 300 000 - 400 000, and then creates 3-4 more vectors of the same size within its loop iteration.
I would expect the parallel for loop to mean that my code doesn’t get linearly slower with the number of channels. But it seems to - the more channels I have, the more time that the various thrust functions I use in the loop (thrust::fill, thrust::transform and thrust::copy) take.
Why is this? Is it because using --default-stream per-thread
doesn’t work with Thrust out the box? Is it because the amount of data I’m handling might be enough to saturate the GPU’s resources (this is on a Jetson Xavier AGX), and if so is there anything I can do to change that (e.g. a custom Thrust allocator)? Or is it something else entirely?