cudasetdevice no effect cudasetdevice question

I’am experimenting a strange behaviour of my Fortran-MPI-GPU program and I
would like have some advice about it.

  1. Schematically the program is a big loop where the most time-consuming computation is done
    on CPU or GPU depending on the processus (proc).

  2. The program splits the computation in many parallel MPI processus (proc).

  3. Any proc control if a GPU device is present and with a little routine I can assign to it
    the correspondent GPU if present.
    To do that I use CudaSetDevice() and the GPU are set free at the end of the program.
    The program was tested with many GPU on multi-core and/ore multi-nodes platforms and it works fine.

  4. Now I wrote a kernel to use the GPU also in the second most consuming-time routine.
    Due to a mistake, I have left any proc call the new kernel (also where the GPU was not assigned).

    I was astonished that all worked fine.
    I aspected that, because I set the device, the other procs give me an error, but no:
    any proc called the same GPU en execution the kernel without problem and with good result.

Have you an explaination?


If you do not set the device explicitly it will use device 0, or the first CUDA capable device present.

You said earlier that the device wasn’t being selected. If you don’t set the device, your application doesn’t use the CPU? wouldn’t that be the reason because it is working?

Thanks for your reply. Probably I not have well explained my problem.

I try again to formalize it better (I hope).

4 procs. The proc 1 is associate to the only GPU (setting the variable devok=1, for the other proc is set to 0).

The proc 1 set also the device (cudasetdevice).

The first kernel is exectuted only by the proc 1 (by an if on devok) while the other procs do a =n alternative computation on CPU.

When the next kernel arrives. I don’t do the check on devok but I permit to all procs to call the kernel.

All procs run on the GPU without problem. Whi?

I have set the device on the proc1, normally the other proc should not run.


It is because if you don’t set the device, it will run on the device 0 or the first cuda capable one present. It seems that you have 4 process on the same machine. In the second iteration, processesses 2,3 and 4 will use the device without pro problem. The problem that may occur is that on the second iteration, processes 2,3 and 4 will not have the necessary data allocated on device, but this will depend on the kind of problem you’re solving.

I set the device: before the first kernel and from proc 1, the device will free only at the end of the program.

If I understand the card is not assiciated to the proccessus, as I beleaved, but to the hostname. Is it?

This could be dangerous on multicore machines.

My second computations could be in the latter case you evoked: not sufficient memory for date from all precessus.

Do you know simple solution or I have to set-unset the device any time (also when total allocated memory< globalmemory)?

I’m not quite so sure, I have to dig a little more.

What are you using to create and manage the processes? fork, MPI?

I use MPI.