Random behaviour with TESLA C870

Hi all,

I have been running my CUDA-based program for the last 6 months using CUDA 1.0 on a GeForce 8600M GT in my laptop and under OpenSUSE 10.2. It worked perfectly until last week I changed the plattform. This is giving me some problems. My new plattform is an Intel Quad Core 2.4GHz with OpenSUSE 10.2 and CUDA 1.1. My program now runs on TESLA C870 and the pseudocode is basically as follows:

while (n<1000)





The problem is that the while loop never executes 1000 times. It typically performs a random number of iterations between 50 and 600 (more or less). Sometimes more, sometimes less. It thought that it might be related to the 5 seconds running time restriction for the kernels but the C870 is not connected to an X display and I am quite sure that the running time of these kernels is on the order of miliseconds.

Does anyone know which changes from CUDA 1.0 to CUDA 1.1 can produce such a behaviour?

Thank you in advance for your help.

I don’t think it’s related to X server 5 second timer since it’s just not connected!

However as you mentioned there’s a version upgrade, maybe you should check the difference between two generated ptxs to see what’s going on.

Cuda 1.0 was blocking on each kernel.
Try to put cudaThreadSyncronize before updateSource or enable the profiler with export CUDA_PROFILE=1 ( it will default back to blocking)

Thank you! I will try this on monday when I am back at the lab. In any case, it is weird that this only happens with the TESLA C870 and not when I execute on the main graphics card…

Actually, my updateSource() reads some bytes from the device, modifies them and writes them back. So, do you say that maybe the device is starting some kernel executions or cudaMemCpy without finishing the previous ones due to asynchronous execution?

Hi again. Unfortunately, this does not seem to be the reason. Actually, I do not have this problem when the program runs on the other graphics card (GeForce 8600 GTS). Is there maybe some fundamental difference between the GF8600GTS and the TESLA C870 that I should be aware of?

That’s interesting, I also experience a random behaviour with a Tesla C870 with a similar code : several kernels inside a loop (approx 100 iterations). I have a 64 bit system (fedora8) with CUDA1.1 .
As in your case, it is random : sometimes it executes all the 100 iterations, sometimes it hangs up before. I have a 8500GT for display and as in your case when running on this card there is no problem.

Moreover, I had a previous platform with a 8800GTX CUDA1.1 running on a 32-bit fedora system which was running fine, so we can assume the problem doesnt come from CUDA1.1.

I first thought that the problem could be related to the move to the 64 bit system ( the 64 bit compiler seems quite buggy), but that wouldn’t explain why the program runs fine on the other card (8500GT for me, 8600 GTS for you). Anyway, this would be interesting to know if your new platform is 64bit too.

The only difference I see between Tesla and 8500/8600 GT is that the Tesla is compute capability 1.0 while the others are 1.1, but that shouldn’t change anything , right ?

Is it possible it is a bug of the 64 bit compiler in conjunction with the Tesla card ?

Sorry to dissapoint you but my system is OpenSUSE 10.2 32 bits.

I have tried to enclose the problem by having just one kernel call in the loop. When I do this, I am able of evaluating a LOT more loop iterations but it eventually stops at some point. For example, with just one kernel invocation I can do between 2000 and 6000 iterations.

Is it possible that the card warms up too much and therefore stops execution? How can I monitor the temperature apart from using the nvidia-settings command?

It’s really driving me mad because we bought the TESLA C870 to run precisely this algorithm and it doesn’t seem to work.

At least we know it doesn’t come from the 64 bit system.

It is surprisng to see that the problem seems to come from the Tesla C870 itself. It would be good to hear from Nvidia what difference this card has that could pose a problem.

Maybe try :

  • Upgrade-flashing the BIOS of your motherboard ? (obscure incompatibility because of a non-graphiccard on the PCIe bus ? or something else that a BIOS upgrade might remove ?)

  • Running the machine with no Xserver at all and using either a small old PCI card for text-only terminal (with BIOS set to boot PCI first) and/or the onboard GPU (if your motherboard features one) ?

That’s why I bought a 9800 GTX for my research. Cheaper, but still does the trick and comes with a 1.1 compute capability.

On the other hand, my research doesn’t require 1.5GiB of RAM, so that’s why I can do it.

I’ve seen random ULF/timeout errors occurring in kernels. The usual cause is reading/writing outside of allocated memory. Why this would show up on some GPUs and not others, I don’t know.

Some kernels, even if they don’t access outside the memory just crash if called repeatedly. I’ve never found out why, I just changed the kernel a little bit and the problem went away. Although, these problem kernels failed on all GPUs I tried (Tesla, 8800 GTX, and 8800 GT). So it doesn’t directly apply to you. See the previous discussion, along with a reproduction test at http://forums.nvidia.com/index.php?showtopic=59188

I’ve also seen stability problems if there too many calls to syncthread.

Beyond a certain quantity of calls, kernel can’t be run at all in the device emulator mode, and may randomly fail on the real hardware.

After further research, I have some more information about these random kernel failures:

Succesive invocation of some kernels fail after a random number of calls.

OpenSUSE 10.2 32 bits running on an Intel Quad Core CPU @ 2.4 GHz.


-So far, it only happens with TESLA C870
-The kernel call never returns (I need to do “Ctrl.+C” to turn the program off) so there is no way to figure out what the problem is because the Debug Emulator mode runs perfectly.
-The temperature of the TESLA C870 doesn’t seem to go beyond 70 degrees celsius so I guess we can discard cooling problems.

I will keep this post updated as I find more information.