Streams and multi-gpu

It would appear I have just shot myself in the foot!

Here was my situation two days ago. I have a vector of objects that do calculations. There are as many objects in the vector as there are GPUs in the system. Each object gets constructed with one of the parameter being the device id and within the constructor there is a call to cudaSetDevice().

These objects are now ready to work.
The work comes in the form of a for loop that may have thousands of work elements. To spread the work across GPUs, I start an openMP loop with the team size equal to the number of GPUs, and do something like gpuWorkers[openMpThreadId]->doWork(stuff[loopIterator]).

Each doWork will call an unknown amount of kernels, run for an unknown amount of time. Each doWork is independent from all the others running. This worked well and gave me what I wanted.

Lately, I have worked on “streamifying” everything in my code. The constructor of the worker objects now creates a private stream. Every kernel call takes the stream as its fourth configuration argument and every memory copies have been changed to the async version. The goal was to have more than 1 worker per GPU. It works without a hitch on a single GPU machine, but breaks down on a multi-gpu one. I have tracked down the issue to the (unknown to me at the time!) fact that the streams are not implicitly assigned to a device and every kernel call is sent to the “active” device (i.e. the device from the last call to cudaSetDevice). If the device the stream was created on is not the active device and a kernel call is made, it crashes.

Now, I don’t see how I can make this work. Even if I was to go through every kernel call and add a cudaSetDevice() call prior to the kernel call, I could run into a situation where a thread is interrupted after calling cudaSetDevice but before doing the kernel call by another thread that also calls cudaSetDevice.

Does anyone see a way out of this? I was thinking along the lines of each worker belonging to a given device having their own context (which I assume is not the case currently, but maybe it is).

(Also, I don’t understand why I did not have this problem in the version without stream.)

Edit : I guess something I could do is put an #pragma omp critical{} around the cudaSetDevice() and the kernel call.
I feel like there should be something more elegant than that possible…

How many machines/ hosts - just one?

Would multiple threads on the host not be an option - one per device, to stick with the particular device and its context, and to monitor it…?

Yes this can be scoped for a single node with N GPUs.

My impression was that the OpenMP work team was spawning threads under the hood as part of the

#pragma omp parallel for num_threads(numCudaDevices*numStreamsPerDevice)


This however gives me, presumably, more than 1 thread per device if numStreamsPerDevice is larger than 1. However, even by forcing numStreamsPerDevice to be 1, effectively having one OpenMP thread per device, it still crashes.

Which is kinda why I mentioned that I now don’t understand why my solution worked before this whole streamification. Before, I had one (implicit) stream per device, the OpenMP work team would have as many workers are there are GPUs. I would do one call to cudaSetDevice at the beginning of every doWork and that call would “stick” for every kernel call within that OpenMP thread running through doWork.
Now, just by the fact that I explicitely create a stream and give it to the <<<>>> kernel call, even if there is only stream per device (i.e. the same situation as before except it is not stream0), it crashes.

I kind of feel like I’m missing something simple (or is it that I’m hoping for it :)? )


“Now, just by the fact that I explicitely create a stream and give it to the <<<>>> kernel call”

and you are confident that you have created the streams (for each device) correctly…?

exactly where does it crash (ito an instruction line)…?

I’m not aware that I have to do anything more than just createStream(cudaStream_t*& cudaStream).
It works well when I run the solution on a single-gpu machine, as all calls to cudaSetDevice target the same device and thus it is always the current device.
For multi-gpu machines, it crashes at the kernel call. The programming guide is very specific that this is the normal behavior if a kernel is called using a stream that has been creates while another device was active. So I am not surprised that it crashes, but I’m trying to figure out how to avoid it.

For the nvidia people, maybe attaching the device to the stream would be a good idea!

"For multi-gpu machines, it crashes at the kernel call. The programming guide is very specific that this is the normal behavior if a kernel is called using a stream that has been creates while another device was active. "

True, this is certainly a cause that can cause a kernel launch failure, but at the same time, it is not the only cause that can result in kernel launch failure

“I’m not aware that I have to do anything more than just createStream(cudaStream_t*& cudaStream).”

You need to create the number of streams per device as you require, and you need to reference/ use the applicable streams at devices’ kernel launches
Thus, you should eventually create a number of streams on a number of devices
You should be able to create all streams per each device beforehand, and store the data in an array (of streams), by cycling through setDevice; it would then merely be a case of correctly passing the applicable stream data to devices on kernel launches afterwards
Hence, ensure that you are creating the necessary streams per device correctly, and that you are forwarding the stream data to devices for kernel launches properly

If you find OpenMP limiting in this regard, perhaps consider creating the host threads yourself; it should provide you with more freedom to take care of the necessary overhead

That is what is happening, unless I am misunderstanding what you are saying. I’ll try to elaborate a bit more.

The instance of one worker is encapsulated in an object.
Each object has a cudaStream_t attribute. At first, there is a round of initialization for all the N workers, where N is the product of the number of devices in the node and the number of workers to run on one device (i.e. the number of streams that share a gpu). To initialize itself, each worker gets a device identifier. The first things it then does are cudaSetDevice(deviceId); cudaStreamCreate(&myStream);.
Thus, after the initialization phase, I have a pool of workers, each with its own private stream object and said stream has been created after a call to cudaSetDevice().

For the “work” phase, an OpenMP for loop is launched with as many threads as the size of the worker pool (streamsPerDevice*numDevices). The “work” consists of a few tens to a hundred kernel calls. Here lies the problem that i am not doing a cudaSetDevice() prior to every kernel call. It then results that e.g. two OpenMP threads, that happen to be “linked” to two workers belonging to two different devices, will be switch to. Kernel calls will be issued, but these kernel calls will have been issued to two streams having been created on two different devices. One of those calls will crash the application due to the fact that the “current” device is not “its” device (the device on which the stream has been created.

Quite frankly, unless I am missing a simple runtime option that I have simply not turned “on”, I find it rather odd that the cudaSetDevice is not handled internally to the kernel call on which a stream argument is passed.

It is not enough to simply call the kernel with the correct stream. The device the stream has been created on has to be set as the “active” device prior to the kernel launch being issued. I don’t see how this is possible without making the “setting of the active device” (cudaSetDevice) AND the kernel launch one atomic operation. If this is not an atomic operation, there is always a chance that the thread will be preempted after doing its cudaSetDevice but prior to issuing the kernel launch, just like any race condition.

I feel like (but do not know for sure) this would not be an issue if I had one process per device, and 2 streams per process. The runtime would sort it out. If this is true, then I am missing some kind of container/separation that would tell the runtime e.g. “these threads always have device 0 as active” and “these threads always have device 1 as active”, eliminating the need for all of those cudaSetDevice prior to the kernel launch.

“For the “work” phase, an OpenMP for loop is launched with as many threads as the size of the worker pool (streamsPerDevice*numDevices).”

I would argue the case for a 1 to 1 ratio between the number of host threads, and the number of devices, rather than a many to 1 ratio (between the number of host threads, and the number of devices)
If you assign a host thread to a device (instead of a host thread to a device worker object/ task), and if you sub-divide the worker pool among the devices - and thus the host threads - you may have slightly more overhead, and may need to throw out OpenMP because of limitations, but i am confident that you would actually be able to accomplish what you wish to accomplish, as the device (setDevice) would remain constant for each host thread and its lifetime, and consequently also the kernels each host thread launches

Hello again!
Indeed that is what I had before and it worked fine. It however only worked when I didn’t explicitly create and use streams, i.e. everything was done on each device’s stream0 and no fourth argument to the <<<>>> construct.

However, now that streams are part of the equation, even if I force the number of streams per device to be 1, effectively ending up with what you suggest (1 host thread per device), it still fails. The simple act of not using the default causes it to crash, which I find odd.

The whole point of this experiment is to use more than 1 host thread per device. The work is so small that 1 host thread cannot use a gpu fully and I need concurrent kernels to be efficient.

Thanks for indulging me!

I think you are missing me

Let me try to put what I am saying in pseudo code of some sort
Lets forget about OpenMP at first; you can later determine whether you should still use it

Given x devices,
count c = worker objects / devices
offset o = sum of preceding counts
create x threads on the host (not the device)
pass unique device number, z, and count, c, with offset, o, to host thread

within each host thread:
set device to device z (based on device number passed to host thread)
create y streams on device z
launch count c worker objects, starting at offset o, as kernels in y streams of device z

I know that this works

I must thank you for your insistence that it should work the way you described. We did not agree on terms but what you described is what I had in the code, and what I also thought should work.
It forced me to establish a simpler test bench to convince myself that it does indeed work the way we both described. And it did. So I went back to the complete solution and looked for my bug.

Thanks again!