Different performance from different GPUs with Identical Code

I have a Linux Box with three Tesla C1060s running CUDA4 and the latest drivers. I’ve got a piece of code that runs fine on GPU0, however, when I attempt to run the same code with the same input (except for one input variable that changes between 0, 1, and 2 to indicate which GPU to use) on GPUs 1 and 2 it crashes with an unknown error. I thought the behavior should be identical. Literally the only thing that changes is the input to cudaSetDevice at the beginning of the code. Am I missing something? Has anyone else experienced this?


Is the deviceQuery program working?

Yes, deviceQuery registers all three devices GPU0, GPU1, and GPU2 as Tesla C1060. I can even run nvidia-smi when I first call the function and I can watch the memory and processor of the device being utilized until it crashes.

How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn’t.

That’s possible. I’ve left the office, so I won’t be able to check until Monday.

So I’ve nailed down exactly what line of code is causing the crash, however I have no idea why it’s causing the crash on certain GPUs and not on others. To me it should cause a crash at all.

I’m performing some dense-sparse matrix multiplication using a kernel function I wrote myself. Each thread calculates one row of the result matrix. I’ve declared created a single device array that is twice as large as the matrix. Since I’m performing multiple iterations of the multiplication I begin with the dense matrix in the first half of the array, and store the result in the second half of the array. On the second iteration I multiply the result from the first multiplication which resides in the second half of the array by another sparse matrix, stored elsewhere, and store the result of this multiplication in the first half of the array. The result just keeps bouncing back between the first and second half of the array. The host function which issues the kernel call repeatedly just switches between feeding in the first half as the input/second half as the output and feeding in the first half as the output/second half as the input.

If I comment out the line in the kernel function where the result is written to the output portion of the array the code runs without any errors (except that I’m now no longer saving my results).

Now I’ve run a bunch of tests and it always boils down to writing the result to somewhere in the array. I’ve verified and no where in the code is an input declared to be a constant input, so I have no idea what is going on. Also, mind the fact that the code runs perfectly fine on GPU0, but crashes on GPU1 and GPU2. I’ve also run the code on another Linux box with two GeForce GTX 560s. It crashes on GPU1 unless the same line of code is commented out. I can’t test it on GPU0 because that is currently running other code.

  1. Switch output and input in the kernel function, code crashes. Comment out line of code where I’m now writing the result to the input portion of the array and code runs without issue.
  2. Alter host function calling device function so that it doesn’t alternate which half of the array is the input and which half is the output. Both scenarios (first half is input/second half is output and first half is output/second half is input) causes the code to crash unless the line of code which saves the result to the array is commented out.

Has anybody experienced anything of this sort? It’s completely bizarre and frustrating to me.

I am doing the same type of iterative process with ping-pong storage like you and never experienced any problem. It’s very difficult to say anything without seeing the code. A blind guess: are you sure to not write beyond the and of the array ?

I have some experience using CUDA with multiple GPUs (I’m running 14 in total, distributed across several machines). Of the 14 cards I originally ordered (GTX 570, from EVGA), 4 tested positive for faulty memory. I used memtestG80 for this. These cards with bad memory were also yielding erroneous results in computations. I recommend checking the memory if you can.

Going to test this tomorrow morning…

Here’s the code in question. Like I said a kernel is launched per row of the matrix. The dense matrix right multiples the sparse matrix which is stored in column-sparse row format (sparse_values, sparse_rows, sparse_cols). Each kernel calculates a row of the output matrix. nRows = leadDim = 55316, nCols = 600. A single 66379200 entry double array is declared, which for the purposes of post let’s call Matrix. output and dense are set to either Matrix or Matrix + 33189600, and switch to accomplish the ping-pong type data shuffling I described above. grid = (217,1,1), block = (256,1,1).

global void sdMult(double *output,double *dense,double *sparse_values,unsigned int *sparse_rows,unsigned int *sparse_cols,unsigned int nRows,unsigned int nCols,unsigned int leadDim){

    unsigned int n = blockIdx.x*blockDim.x+threadIdx.x;

    if(n < nRows){

            long int offset = sparse_rows[n];// offset for reading from sparse_values and sparse_cols

            int nValues = sparse_rows[n+1]-offset;// number of non-zero entries in this row

            double temp;

            double *input = dense;

            for(int j = 0; j < nCols; j++){// loops through each column

                    temp = 0;

                    for(int k = 0; k < nValues; k++){

// temp += sparse_values[offset+k]*input[sparse_cols[offset+k]];

                            temp = fma(sparse_values[offset+k],input[sparse_cols[offset+k]],temp);


                    output[j*leadDim+n] = temp;

                    input += leadDim;




I downloaded memtestG80 and ran it on all three Tesla C1060s. I didn’t run multiple iterations of the test because at 4096 MBs the tests take quite a while, but there were no errors in the tests I did run.

Have you run the code in question under cuda-memcheck? Results from both devices might be interesting.

As far as the power supply concerns I switched the cables powering the devices and got the same pattern of failures. Also, we specifically purchased this power supply because we blew out another one prior to this. This one should be supplying enough power to run all three simultaneously, we have done so in the past.

I ran cuda-memcheck. I get no errors when I select to run the code on device 0, however device 1 and device 2 I get a “Invalid global read of size 4” error. The error does not appear to have any pattern as far as the thread and block that trigger it on device 1 I got errors in t32b98, t64b1, t192b76, t32b77, t32b15, t96b16 and one device 2 I got errors in t224b27, t32b14, t32b70, t32b0, t128b59, t32b15. The addresses that cuda-memcheck is complaining are out of bounds are also curious, as they seem to small to me out of bounds. One device 1 0x00219880, 0x00201500, 0x00214300, 0x00214480, 0x00204c80, 0x00205180, and one device 2 0x0020a780, 0x000001c8, 0x00212880, 0x00201080, 0x0020fe00, 0x000001c8.

Also, how can two TeslaC1060s be running the same exact code, with the same exact input and one thinks it’s accessing memory that is out of bounds, and the other runs fine without any issues!?!

Thanks to everyone who’s taken a stab at this. I really appreciate your input. Hopefully someone out there can figure this out for me.

Have you tried to use CUDA_VISIBLE_DEVICES to enable just one card at the time?
It will appear as device 0.

I don’t know why I hadn’t thought of that. I just did it, and yes, the code runs on the other two GPUs now. So what do I have to do differently so that when those GPUs are indexed 1 and 2 the code works? I thought all I had to do was “cudaSetDevice(deviceID);” with deviceID either 0, 1, or 2. It only seems to be working when deviceID is 0 though.



You may have allocations done only on the first device. Are you using constant memory?

No I am not using constant memory. I am only using global memory, which I thought would be allocated on the device specified in cudaSetDevice(deviceID).

So, is the only difference cudaSetDevice at the beginning of the code?
It is a very common use case and it is the first time that it seems not working properly.

If you need to run multiple instances at the same time , you could use the CUDA_VISIBLE_DEVICES variable or put the cards in exclusive mode with nvidia-smi.

The code is identical, the only difference in calling the code is a 0, 1, or 2 and the only place that input is used in the code is in cudaSetDevice as the deviceID. I want to run multiple instances so that I can essentially triple my throughput. That’s why I have three Tesla C1060s and not just one.

So I’m able to get all three GPUs running smoothly as long as I open three separate connections to the server and only enable a single GPU to be visible on each of the connections, but this is really a hack. I hope NVIDIA looks into why cudaSetDevices with an input other than 0 appears to be causing trouble in my case.

Thanks everyone for your help.

Which driver version are you using? I strongly recommend you use the one from the CUDA download page, not the one from the general download page
( don’t be fooled by the higher number).

If using a good driver does not fix your problem, could you please file a bug with a small repro?
Otherwise, it will be difficult for us to find out the cause of your problem.

Putting the GPU in exclusive mode (nvidia-smi -c 3) as I suggested to you before, should simplify your setup.