Different execution time on multi gpu 4 equal cards, different execution time

Hello,

I am trying to use a system with multi gpus and I am having some extrange behaviours, hope some one here can help me solve this.

The machine on which I am running the code has 4 C2050 cards. I am using OMP for managing the threads

The code divides the work between the 4 of them on equal size for each. Each thread will calculate one part of a big array. All of this is done after some previous calculation that are done in all the gpus (this results will afterwards stay in the gpu and be used again)

I measure in each thread the execution time of the function that is divided using gettimeofday and 2 of the 4 GPUs give me the same amount of time, and the other two give me another. It is like 2 of the card run faster that the other 2. The 2 “slow cards” make the code run in allmost the same amount of time when run with 2 or 4 gpus.

I dont really know what is causing this difference.

Any ideas?

I attach the source code
intentoMultiGPU.cu (10.8 KB)

Is the observed pattern same after multiple runs. Do you see the same two cards taking more time. Could this be a power issue ?

It sounds like you are establishing multiple contexts on a single device. Try using nvidia-smi to set each GPU to compute exclusive first, then run again and see what happens. That should force each GPU to accept only one context. Personally I think OpenMP isn’t a great choice for multi-gpu for this precise reason: maintaining correct thread-GPU affinity can be anything from difficult to impossible.

In 3.2, at least, that’s true. In future versions? Well… :)

Well that is a bit a of a tease…

It’s probably enough to say that we have a pretty keen understanding of where the pain points in host code are at the moment, so we’re going to fix them.

Jaideep: Yes it is the same results over several runs

avidday: I dont think I am using the same gpu as I assign a different one on each thread. In fact, I print before the execution of the timed code the gpu used on each thread as I get the 4 possible values (one different for each thread).
I have never used nvidia-smi, could you please give some more tips on what you are suggesting me to do. If you dont like omp, what do you use, MPI?

Thaks to all for your help

I didn’t mean to imply you were running on only one gpu, only that you might not always be running on all 4. Certainly the “two fast” and “two slow” sounds more like two threads have their own GPU, and the other two are sharing the third. The problem could be that while all threads might initially get their own contexts, because most OpenMP implementations use persistant thread pools, there is no gauarantee that the GPU-thread affinity is correctly maintained right through the life of an application run. I use either MPI or pthreads for multi GPU codes, depending on what the code does.

To use nvidia-smi, you can do something like this:

avidday@cuda:~/fimcode$ nvidia-smi -L -d

 GPU 0: (05E610DE:34CE1458)  GeForce GTX 275  (S/N: 154552075019)

 GPU 1: (06CD10DE:079F10DE)  GeForce GTX 470  (S/N: 6178347089)

avidday@cuda:~/fimcode$ sudo nvidia-smi -g 0 -c 1

avidday@cuda:~/fimcode$ sudo nvidia-smi -g 1 -c 1

avidday@cuda:~/fimcode$ sudo nvidia-smi -s

COMPUTE mode rules for GPU 0: 1

COMPUTE mode rules for GPU 1: 1

you need to make the -c call once for each GPU. If you don’t run X11, then you need to do it right before running your code, perhaps something like

nvidia-smi -g 0 -c 1; nvidia-smi -g 1 -c 1; nvidia-smi -g 2 -c 1; nvidia-smi -g 3 -c 1;OMP_NUM_THREADS=4 myapp

because the driver will unload itself and loose compute mode settings if the cards are idle for more than a few seconds without a client attached.

I changed to MPI and everything works fine now, thanks!