CUDA initialization hangs when a shared library is loaded pulling my hair out

OK, here’s a really wacky problem that I have absolutely no idea on. The one saving grace is that it only appears to be happening on one specific node. Hopefully a reboot will solve it (waiting for the cluster admins to do so). However, it has resulted in much hair-pulling, so I thought I’d share.

This node has up and running CUDA 3.2 and drivers fine for a few months. Recently, certain jobs started hanging on it. We narrowed it down to a single plugin (by plugin, I mean a .so file loaded into python with an import command). With the plugin loaded, the process hangs at the very first CUDA call in the application. Without the plugin loaded, the application runs flawlessly. In debugging this, I removed everything from the plugin except the init_ function that python calls (the function was left empty). Loading other plugins works fine. How’s that for strange behavior! I can only guess that some how, this particular plugin by its name ends up at some point in the memory space such that it triggers a problem in the CUDA runtime/driver/somewhere…

If I attach gdb to the hung process and ask for a backtrace, this is what I get.

#0  0x00002b4ddadc5a6f in fcntl () from /lib64/libpthread.so.0

#1  0x00002b4de85270f1 in pthread_attr_setdetachstate () from /usr/lib64/libcuda.so

#2  0x00002b4de7f31a44 in pthread_attr_setdetachstate () from /usr/lib64/libcuda.so

#3  0x00002b4de7f11626 in pthread_attr_setdetachstate () from /usr/lib64/libcuda.so

#4  0x00002b4de7ef6069 in pthread_attr_setdetachstate () from /usr/lib64/libcuda.so

#5  0x00002b4de7fc9459 in pthread_attr_setdetachstate () from /usr/lib64/libcuda.so

#6  0x00002b4ddb943f4e in ?? () from /home/software/rhel5/cuda/3.2/cuda/lib64/libcudart.so.3

#7  0x00002b4ddb936fbb in ?? () from /home/software/rhel5/cuda/3.2/cuda/lib64/libcudart.so.3

#8  0x00002b4ddb93ab3c in ?? () from /home/software/rhel5/cuda/3.2/cuda/lib64/libcudart.so.3

#9  0x00002b4ddb934379 in ?? () from /home/software/rhel5/cuda/3.2/cuda/lib64/libcudart.so.3

#10 0x00002b4ddb96983b in cudaMalloc () from /home/software/rhel5/cuda/3.2/cuda/lib64/libcudart.so.3

.... application specific stuff below, this cudaMalloc is the first CUDA call made - the one that initializes the context

Since this only happens on one particular node, I can’t make a repro case, and I expect (well, hope) that a reboot solves the issue - so one might just call this a broken driver that needs a reboot to clean up. What’s so weird about it is that it doesn’t return any error codes, and most jobs run just fine.

That’s a GPU sync. If rebooting the node doesn’t help, it’s potentially a hardware issue.

Hmm… but looks like other apps r running fine while only certain jobs are hanging…
Tim, If thats a hardware issue, what are the potential hardware that is affected? Can you tell us more?

Falls off of PCIe, something like that. I’d bet money that that’s the innermost synchronization loop, so when we first synchronize with the GPU after initializing the context that’s failing.

Tim, Thanks for the quick reply…

But uh…Sorry I cant really understand your concise reply.

I don’t see any loops in DrAnderson’s code. I guess you are talking about some internal stuff (implementation of cudaMalloc?).
By “Falls of PCIe” – Do you mean “PCIe error or something?” – which causes a particular wake-up to not to happen, right?
Otherwise, the hardware is up and running stable…
Is my understanding, Correct?

Thanks for your time,

Thanks for the info, Tim.

Well, if it is a hardware problem, then at least 3 nodes developed the problem simultaneously (which seems unlikely).

A reboot of the first node I mentioned didn’t appear to resolve the issue. Now, I’ve got a new build of hoomd that exhibits the problem without dynamically loading any additional libs. This new build differs from the previous in that it links to cufft and has a few added classes and functions.

I’ll have more testing information in a day or two, as jobs complete and free up nodes on the cluster. What I have from the nodes that are free right now is interesting:

5 Tesla S1070 nodes initialize CUDA and run my test fine.
3 Tesla S2050 nodes hang in the CUDA initialization with the same backtrace as in the OP.

Notably: (so far) 0 of the available S2050 nodes pass the test, and 0 of the available S1070 nodes fail the test.

Our S2050s are hosted on x86_54 RHEL5 boxes with kernel 2.6.18-194.11.4.el5 #1 SMP and nvidia drivers 260.19.21. The S2050’s are attached with the PCIe interface card that connects all 4 GPUs to one slot.

Would trying driver version 260.19.36 be likely to change anything?

Update:

After a full power cycle of the nodes and GPUs (instead of a warm reboot as before), the behavior has now changed.

Of the two nodes that were power cycled by the admins, one now runs all 3 builds of hoomd flawlessly. The second node runs the normal build fine. But the normal build + plugin loaded has a 40 second delay when initializing the CUDA context. The newest build also runs, but has a 1m30s delay when initializing the context.

So now the issue appears to have become the typical long initialization delay on headless cluster issue that so many experience. It is just odd that the usual workaround (running nvidia-smi --loop-continuously in the background) isn’t solving it here, and that depending on what libraries are linked to the executable the delay comes and goes. We’ll see what happens in a week when the maintenance is complete and all 24 nodes have been power cycled.

Update2: (getting this out here so no one else has to go through the same pain and suffering I did).

This problem has gotten much worse over the last few months. Random jobs started having this behavior, not just ones with certain plugins loaded. Sometimes a soft reset would solve the problem. Sometimes a cold boot would. We upgraded to CUDA 4.0, hoping that the situation would get better. Now, ALL of the 2050 nodes exhibit this problem from a cold start. And “the problem” is not just a delayed initialization, any CUDA process will hang indefinitely (I’ve seen up to 100+ hours).

You won’t believe what the root cause turned out to be… That fcntl in the backtrace where the freeze is occurring? This is it (from strace):

open("/home/joaander/.nv/ComputeCache/index", O_RDWR) = 19

fcntl(19, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>

Our cluster uses nfs home directories. This fcntl is attempting to obtain a write lock on a file on NFS. For some reason, this system call will freeze indefinitely on our cluster. Setting HOME=/tmp enables the jobs to run fine, and now I’m testing CUDA_CACHE_DISABLE=1 to see if that will also work around the problem.

I’ve raised this issue with the NVIDIA bug report tool - we’ll see what the developers have to say about it.

Network and GPU are clashing?

The cluster admins report that since they switch to an Isilon backend for NFS home directories, they’ve been having a lot of problems with file locking.

Ah… Purely network stuff… So, the thing should hang even without the GPUs then…
My friend works for Isilon…I will check with him.