I have encountered a really strange problem on my GTX295. I have a CFD software, which is running fine on a computer with single GTX460, and also on another with Tesla C2050>X560.
However, if I start it on GTX295, strange, random data changes will happen in the memory, which will corrupt the convergence of the iterative solver. However, I have noticed, that if I choose device 1 and not device 0 (the program usually starts simulations on both GPUs parallel), than it does not happen. Maybe I have missed something, but does enybody have any idea why is it happening, and how could I fix it?
I use CUDA 4.0 & Geforce 280.26 driver for Windows 7 x64.
Thanks for the help in advance!
Device 0 is presumably the device running your display. That device will have the watchdog timer running, so it is possible to have a kernel terminated prematurely by the driver if it runs more than a few seconds. How are you checking the return codes from CUDA calls? This would be your indication that kernels are not finishing.
However, I do not think that the kernel would be running too long, as from one side - as it is the display device - I have TDR enabled, and from the other side the kernel calls are usually less than 5 milliseconds (instead of using very long kernel calls, the program uses huge number of shorter kernels - it is the only way to implement the Red-Black Gauss-Seidel SOR iteration correctly). And otherwise, it would not work either on my other two computers, where the GTX460 & 560 is the display device too.
However, for awareness, I have implemented into the program to check any error by cudaGetLastError() after kernel calls, and in case of any error stop. The program hasn’t stopped, but is returning different results every time… (???)
What is even more frightening, that sometimes I cannot make the program to work, and then suddenly it starts to work. Without restarting or reinstalling anything. This behaviour makes it very difficult to debug.
OK, this kind of device-dependent chatoic behavior is starting to sound more like a memory access violation in your code, or a bug in the driver that is outside your control. Unfortunately, I’m not familiar with what the debug options on Windows are. On Linux, I would at this point run cuda-memcheck to see if it could find any places where I was accessing unallocated memory. In general, these tools can’t work on the device managing the display, but it is worth running some kind of memory check on the non-display device, even if it appears to be working.
I have been thinking about it, but I could not locate any place, where the program would access any uninitialized memory place. And if so, why does it not happen on the other two computers (where I use the display device too)?
Using uninitialized memory can sometimes do the right thing on accident. I’ve had bugs that only manifest on particular operating systems, or even a particular GPU in a specific PCI-Express slot that ultimately were due to using uninitialized memory.
Unfortunately, I think that cuda-memcheck can only detect out-of-bounds memory access violations. If you are using correctly allocated memory that is not initialized, I think the only tool that can discover that is Ocelot, which doesn’t come with the CUDA toolkit. (And is quite involved to install, last I checked.)
One technique for finding accesses to allocated, but uninitialized, memory is to initialize all allocations with 0xFF with cudaMemset() prior to running the kernel. This works especially well for floating-point data, since a bit pattern of all 1s maps to NaN. To find uses of uninitialized integer data, two passes may be necessary, e.g. pre-initialize memory with 0x55 for the first pass, then 0xAA for the second pass, and compare the outcomes.
In addition to possible use of out-of-bounds access or use of uninitialized memory, have you checked for potential race conditions, e.g. a missing __syncthreads() when operating on shared memory? Or multiple threads writing data to the same memory location, maybe through an error in an indexing calculation?
As for the shared memory, the kernel, which is failing, does not use any shared memory (from one point it would requre too much, and from the other side the program is already so fast without the shared memory, that writing out the results to the HDD takes more time than the calculation itself, so there was no need to speed it up more), nor uses any syncthreads (every thread writes to one location, which cannot be accessed by any other thread (controlled through indexing)).
I have checked the code several times, and haven't found any problem yet. What makes me more confident about saying that there is no multiple access to a single memory location (or uninitialized place), that the program has a CPU version version as well, and when running correctly on the GPU, the two results are the same up to 12 digits, even after huge number of calculations. If wrong indexing would appear, it would destroy the correct results...
At the moment - as I lack a 2nd GPU for debugging - I use a very simple idea to check the GPU memory. I download the data to the central memory at specific places of the code, and save it to disk. This way I can easily compare it with the data generated by the CPU or by another GPU. My main problem with this error is that I have checked for a single iteration, that all data are the same, and initialized correctly. After that the program does not allocate any new memory, but works only on the previosly allocated. So it is sure that if in the first step no unallocated memory has appeared, than it won’t happen.
I will keep looking into the problem, it might be that simply I have missed something, but at the moment I do not think so. The program is now running different cases for 2 years, without a single fault, so I see no reason to appear now…