during the startup of my multi-GPU simulation on a single host (e.g. two GPU cards in one server) I call acc_get_device_type(). There seems to be no issue unless I run it through the debugger, which runs all the ranks in a lockstep fashion (I am using totalview). I get the impression that there is a race condition and both ranks try to call acc_get_device_type at exactly the same time and the simulation stalls. The only way around it is to set a couple of breakpoints and then step the ranks individually one after the other of this code.
This is not very convenient and I am wondering whether my understanding is correct and it is simply a limitation by the driver/runtime.
If this is multithreaded a la OpenMP or pthreads, then, depending on what Totalview means by ‘lock-step,’ there could be a problem. Each thread does try to initialize the device(s) in a call to acc_get_device_type(), or order to count them. To make sure that all threads get the same device contexts, they enter a critical section which is protected by a spin-wait lock. Depending on what Totalview does to make the threads run in lock-step, this could well cause a problem.
If this is an MPI program, then I am puzzled. MPI ranks do not share memory and each rank must initialize each device individually. In that case, I really do not know what might be causing the problem. Since the ranks don’t share memory, even if they enter a spin-wait lock, that thread is the only thread for that lock, so it will exit the spin-wait loop immediately.
I am running two MPI ranks each of which talks to one GPU only. In totalview control is set to group and so both ranks should run simultaneously (lock-step), but how this is implemented in totalview I have no idea.
But maybe I am doing something wrong in my setup where I query how many GPU devices each rank sees on the server it is running. With this information I check that each rank can be a assigned exactly one unique GPU device (currently I don’t support more than one GPU device per rank as every rank is responsible for one subdomain) and then set the device number for each rank.
To retrieve the initial information about how many devices are seen by each rank (afterwards I have to do some sorting because there can be more than one rank running on a single server) the following code is executed by each rank:
Unfortunately, I’m not going to be much help on Totalview. As Michael stated, there shouldn’t be anything in the acc_get_device_type routine that would cause things to block in an MPI process. Maybe with OpenMP, but not MPI.
Do you see a problem with this approach?
What I like to do is gather all the hostids from all the processes so I can determine how many processes are running on a node. Then get then number to devices and round robin which process on the node gets which device. While written in Fortran, you can see an example under “Step 1” of a PGInsider article I wrote a few years ago: Account Login | PGI