Basic CUDA calls hangs for some GPUs

Hi!

I am working on a project where we use CUDA for GPU processing. Everything worked fine before christmas, but when returning to the project now, we have encountered a weird problem, without having modified any code at all. When trying to debug the error, we found that it was not only a problem in our project, but rather a problem with CUDA in general on our machines.

The problem is that some CUDA function calls never return, such as cudaMalloc and cudaMemGetInfo. It seems like this happens for all functions that accesses the device in any way, while calls such as setCudaDevice works fine.

We use CUDA 5.0, running on Ubuntu 11.10, 64 bit, on several machines. X on these machines are disabled (we use ssh), so nothing else should run on the CUDA cards but our program.
We have tried using different cards, such as GTX460, GTX480, GTX280 and NVS295, and the problem only occurs on the GPUs with compute 2.x, ie. the 460 and 480.

A basic C++/CUDA example is provided below. As you can see, we just try to allocate a small buffer of 1 byte on the device, but when running the code, it hangs at the cudaMalloc call, without returning any error codes, even when running in cuda-gdb.

#include <cuda.h>
#include <iostream>

using namespace std;

int main(int argc, char *argv[]) {
  int devices = 0;
  cudaGetDeviceCount(&devices);

  if (devices < 1) {
    cout << "No CUDA devices found!";
    exit(0);
    
  } else {
    cudaSetDevice(0);
    
    char * buf;
    
    cerr << "A" << endl;
    cudaMalloc((void**) &buf, 1); // Allocate 1 byte on the device
    cerr << "B" << endl;
  }     
    
  cudaDeviceReset();
}

It also hangs on “Allocating GPU memory…” in the SobolQRNG sample provided with the CUDA SDK. At the same time, the deviceQuery sample works without errors.

Do anyone know what might be causing this, or how we can fix this?

Thanks!

So any idea what changed during Christmas? Did you upgrade to CUDA 5.0, did you update any drivers? I had an experience similar to this because of a driver issue, so you could double check that.

We have been using CUDA 5.0 from the start, so no changes here. The only updates we have run, are the updates provided by the update manager in Ubuntu. These, however, contains updates for almost everything, including new kernel versions.

As a side note, we just installed a GTX650 for testing compute 3.0, and it seems to NOT work on this card either.

We also just installed CUDA for the first time on one of the machines, and it does not work on this one with the 2.x cards either.

It is also very interesting to see that it works some times, but others not. We have not been able to pinpoint exactly in what situations it works or not, as it seems totally random

Another thing: We have lately had some network issues. We currently use NIS and NFS on all the machines in the network, but there are some issues with the NIS server (which should be fixed pretty soon), which makes the access to our home folders rather unstable. We have all put the CUDA environment variables in the bashrc-files, which means that there is a possibility that these cannot be set when network instability occurs. Could this be a possible source to the problem?

This issue has been solved. It occured due to our problem with the network, so when we solved the network issue, everything started working again.
Probably, due to the network issues, the CUDA runtime could not access some libraries, and therefore hung.