Speed problem on 295 gtx cards

I’ve noticed, what I think, some weird behaviour today during a run. I have two 295 gtx card in a computer, so I have 4 different devices in there which I can use. I started two runs on two different devices and these runs are using the same settings and code so they should take about the same time to finish.

The behaviour that caught my attention was that it seemed like one kernel were “waiting” on the other in some way. The way I noticed this was from the output the programs writes to the console. They seemed to progress in sync so to speak. In the program Im using the execution goes through different environments. One part is in Java and from java (over JNI) the cuda part is called and then when its done on the gpu it goes back to host-cuda and then java code again, and this repeats many times. When the flow returns from the gpu some information is written to the terminal. This information was always written at the same time from the two different runs, where each run was on a separate gpu device.

To check if it was a coincidence, I changed one of the runs to only use 20 000 rows as input where the other run still used 100 000 rows. In this particular problem the data input is read from a file and used as input for the calculation on the gpu. The number of input rows is(should be) proportional to the time it takes to execute the kernel, so here it would ideally be around 5 times faster. But even after I changed to 20 000 rows they were progressing in sync, and were progressing at the same speed. It “feels” like the kernels has to wait for each other before anyone can continue. This behaviour appears even if I am using physically different gtx 295 cards. Because I have two 295 gtx I can use devive 0 and device 3 for example but this behaviour still is there.

To me this behaviour does not seem logic at all, but there perhaps is some logical explanation to this? Is there some mechanic in the cards that make this behaviour necessary?

The only other thing I can think of that can be the cause to this problem is the graphic driver. Because the driver is the same for all cards possibly the problems comes from there. Im using the linux (Gentoo) for development.

The GPUs on the 295 are independent when they’re running kernels. But they DO share a single PCIE link to the CPU and in that connection, they can in theory interfere with each other.

So perhaps one hypothesis is that your problem is PCIE bandwidth limited. This would cause the effect of having one GPU idle while the other is busy transferring data, waiting for the PCIE link to free up.

An easy way to sanity check this is to figure out just roughly how much data your kernels transfer via PCIE per second. If it’s getting over 1GB/sec, that’s starting to get near PCIE limits which in theory are roughly 4GB/sec but can vary in practice. (you can still be PCIE bandwith limited even if you’re well below the maximum PCIE speed… it’s more about the ratio of compute time to PCIE time. If your kernel spends more time transferring data than computing, that’s when you’ll start seeing collision effects and inefficiencies even when your net transfers aren’t big.)

However this theory would not hold true if you have two physically separate cards and get the same effect… though there it may be a motherboard PCIE bottleneck (this is less likely though). Anyway, think about PCIE bottlenecks… they explain most multi-GPU inefficiencies.

Thanks for the answer. I took the time for a single run and it took ~4 minutes. After that I started two runs, one on device0 and one on device3, and then it took about 8 minutes to finish. It seems like if I’m using X devices for X runs it takes X times longer to finish. It seems like one kernel can’t finish before another kernel is done even thou they arent on the same device. There is something seriously wrong with this because then it would be no point in even having more than one device in each computer. I wonder if it can be something in my code that causes this problem, perhaps for some reason all runs is executed on the same device ( Im using cudaSetDevice(Y) to set the device once each run ), or some other possible error. Otherwise the problem has to be a OS/driver or hardware problem which gives quite a few possible options …

Are you sure each program instance is really running on a different gpu? It sounds awfully like all your two program instances are just timeslicing the same GPU. A GPU temperature monitor program should tell all.

If you are using Linux, try using nvidia-smi to mark each device you want use as compute exclusive and remove explicit device number selection from your program code. The driver will ensure that only one program instance can connect to any GPU at a given time.

Looks like you have the same problem as this guy. Could it be because the GPUs are also updating the display, which could be costly.

Thanks a lot, you made my day. I used nvidia-smi and marked each device as compute exclusive and then things started to work as I expected them to work. One run took about 4 minutes but more importantly when I launched 2 runs it took just slightly more than 4 minutes for both to finish. Im not entirely sure what went wrong before but somehow the runs must have used the same device and it probably had something to do with the “flow” of my program.

I’m using java much in my program which in turn “contacts” cuda code over JNI. After the kernel has executed values are released back to the java code from the cuda code. The java code does operations on the data and then sends it back over JNI and back into the GPU. So to sum up, the kernel is run many times during the java process lifetime ( what I call a run). During the first time the kernel is run I used cudaSetDevice(Y) in a code part that was only run the first time the kernel was entered. It seems like, and this is me guessing, that information disappeared so when the kernel was entered the second ( and all times after ) time there was no information on which device to use ( because cudaSetDevice is only run on the first entering kernel which I thought was sufficient ).

Perhaps if not a specific device is chosen with cudaSetDevice it defaults to 0. So what happens was that each run used the same device on kernel-entering 2 and the subsequent kernel-enterings.

This is only me guessing basically but it feels logic, and it would be interesting if someone else had any input on my reasoning.

cudaSetDevice sets the device for the current thread, and it’s a one-shot initialization… once a CUDA context has been created for a thread, you can’t switch it to another device. So the right way to use multi GPU is to have your host spawn one thread per GPU, then each thread sets their device and does work.

If you don’t set the device, it does default to 0.

If you call cudaSetDevice after the CUDA context has been defined, it will return an error. You ARE checking every CUDA call for errors, aren’t you? [This always bits EVERYONE! You can’t be lazy!]

A new problem occured for me now that I didn’t notice until now. The nvidia-smi solved my problems as described above in this thread when running two runs on separate devices. I have two 295 GTX cards in this computer so in a script that is run at boot time I have the following lines:

nvidia-smi --gpu=0 --compute-mode-rules=1
nvidia-smi --gpu=1 --compute-mode-rules=1
nvidia-smi --gpu=2 --compute-mode-rules=1
nvidia-smi --gpu=3 --compute-mode-rules=1

Everything works perfect when I launch 2 runs, but when the 3rd run is launched the first 2 runs are crashing with a “unspecified launch failure”. Because there is 2 devices available on each 295 card it feels like there is a problem when the 3rd device shall go active because it is on a physically different graphic card. Anyone have any suggestions on how to solve this problem or what may be causing it? The crash always occur when the 3rd run is launched.

Do nvidia-smi and the SDK deviceQuery example correctly enumerate your GPUs? I only have dual GPU machines to test here, so I can’t help you with 3 or 4 gpus, I am sorry.

Yes, the SDK deviceQuery looks correct at least I think so. I pasted it here http://paste.org/pastebin/view/14103 (to avoid the spam here) if you wanna take a look.

Not sure which test I can do with nvidia-smi?

Try this:

avid@cuda:~/code/hpl_dtrsm$ nvidia-smi -lsa

==============NVSMI LOG==============

Timestamp			: Tue Jan  5 17:51:24 2010

GPU 0:

	Product Name		: GeForce GTX 275

	PCI ID			: 5e610de

Failed to read GPU temperature!

	Temperature		: 0 C

GPU 1:

	Product Name		: GeForce GTX 275

	PCI ID			: 5e610de

Failed to read GPU temperature!

	Temperature		: 0 C

Ok, it looks like this:

GPU 0:
Product Name : GeForce GTX 295
Serial :
PCI ID : 5eb10de
Temperature : 49 C
GPU 1:
Product Name : GeForce GTX 295
Serial :
PCI ID : 5eb10de
Temperature : 53 C
GPU 2:
Product Name : GeForce GTX 295
Serial :
PCI ID : 5eb10de
Temperature : 48 C
GPU 3:
Product Name : GeForce GTX 295
Serial :
PCI ID : 5eb10de
Temperature : 50 C

OK so probably isn’t anything on the driver side. Time to go back and look more closely at your code, I guess…

Yes you might be right, perhaps it is something in the code. I just think it feels a bit weird that it is problems in the code with this behavior. It works perfect with 2 runs but crashes when I launch the third so it just seems possible that it has something to do when it tries to start a new run on the other 295 card. If it would try to start a third run and I only had one 295 card it seems logic with the crash because there is no available device to start on. So if it for some reason tries to start run 3 on the same graphic card as the two already running ones this problem might occur. I have no idea why it would try to start the third one on the card that is already occupied thou.

What have to happen for a device to be flagged as available for new tasks so to speak? Now when it is set to “compute-exclusive”-mode the system has to know when the current GPU-execution is done and it can be used by some other process. The reason I ask is because in my program I enter and exit the kernel many times ( start in java code and goes to cuda over JNI and then back to java and then back to cuda and so on ). So the device has to be “reserved” for all this entering and exiting, and not be flagged as available after the first exit of the kernel. I assume this is the case because I only copy needed data one time before the first entering in the kernel and that data is used in all subsequent kernels. Im not sure how the system keeps track on this, for it to work the device has to be “locked” by the java process and after the actual java process is dead the device can be released and be used by soemthing else?

I don’t have any good ideas on how to find the error here.

Compute exclusive only allows one host thread to hold a CUDA context at a time on a given GPU. So it isn’t kernel launches which determine whether a card is free, it is host thread holding a context. You should be able to verify this by writing an “empty” runtime API program that in pseudo code does this:


sleep(some minutes);


You should be able to launch one of those for each compute exclusive GPU. If you launch more, they should fail with a no cuda device error. If that works as advertised, you definitely need to look at how you code works. I can’t really be more specific than that, I am afraid. Java and JNI is something I know nothing about.

Thanks again for your input. I made a very simple program to see if it crashed with that as well and to my surprise it did. I paste the code for this simple ( and totally useless ) program below. If I start one run in this code it finishes without any problem, but if I start a new run while the first one still is running the first one crashes with an unspecified launch failure. I would very grateful if someone could try this code on their computer (if they have more than one device for calculation) and launch at least 2 runs of the code at the same time. Perhaps Im doing something wrong in this simple code as well? Otherwise the problem should be somewhere else but in the code.

The reason for the massive loop is just that I want the execution to stay on the GPU for a while.


#include <assert.h>

using namespace std;

global void VecAdd(float a, float b, float c, float* kD) {

float d = 0;

for(int i=0; i<100000000; i++)

    d += a+b+c+ kD[8]+i;



int main() {

float minarray[20] = {114,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};

float *keeperDevice;

cudaError t1 = cudaMalloc((void**)&keeperDevice, 20*sizeof(float) );


t1 = cudaMemcpy(keeperDevice, &minarray, 20*sizeof(float), cudaMemcpyHostToDevice);


int cgd = 0;

int cgdc = 0;

t1 = cudaGetDevice(&cgd);


t1 = cudaGetDeviceCount(&cgdc);


cout << "cudaGetDevice: " << cgd << endl;

cout << "cudaGetDeviceCount: " << cgdc << endl;

VecAdd<<<128, 128>>>(5.0,6.0,7.0,keeperDevice);

t1 = cudaMemcpy(&minarray, keeperDevice, sizeof(float)*20, cudaMemcpyDeviceToHost);


cout << “Done”;


That works as expected for me. I have two GTX 275s. With one in compute exclusive and the other in compute prohibited, I can launch and run 1 instance fine, trying a second produces an assert failure at the first memcpy, and the first runs to completion. Put both cards into compute exclusive and I can run two to completion without a problem. Both instances show they are running on different devices. I had to do that without X11 running on a pair of virtual consoles because of the watchdog timer, so you are going to have to take my word that it worked, but it did.

This is all on an AMD70FX machine running 64bit Ubuntu 9.04 with 190.18 release drivers and CUDA 2.3.

If I change to setCudaDevice in my simple program and do not use nvidia-smi I can run 4 different runs at the same time at different devices. Maybe there is some problem with nvidia-smi. If I recall correctly I think I read that it was optimized for tesla and that it could sometimes be problem when used on a geforce card.

Just out of interest are you running X11 or not? One of the problems if you are not running X11 is that the driver unloads itself after a while and the nvidia-smi settings get lost. The way around the problem is to run nvidia-smi in daemon mode polling every few 10s of seconds. That keeps everything loaded and preserves all settings. Then you might find your problems go away.

I have used nvidia-smi for maintaining compute exclusivity with literally 10s of stock Geforce cards in our cluster and it has never failed to work correctly.

Ah okey. I am not running X11 so that may be why it crashes now. If the driver unloads after a while that may be the case why it crashes. Thanks for all your input, really appreciate it.