I’m developing a library which incorporates a linear equation solver for lattice field theory. However, I’m running into trouble when I compile using CUDA 2.2. Rolling back to CUDA 2.1, I find NVCC does not compile the code (I think this is down to the templates used in my code, and this problem was fixed again in 2.2), but when compiled under 2.0 the code works absolutely fine (linux driver 185.18.08 Beta 64 bit running on CentOS 5.3 for all tests).
In this application, I have many different kernels, each of which are concerned with the application of a large sparse matrix to a vector (the different kernels are for different precisions, whether to transpose the matrix or nor, etc.). Now in one of these many kernels, the results that are produced by the GPU completely deviate from the CPU results, however, they only do so at regularly spaced intervals on the result vector. These results are exactly reproducible, and are unchanged if I reboot etc. The kernel in question that gets the wrong answer is using double precision for both the matrix and the vector. The code does not use shared memory at all.
What’s also interesting to note is that when compiled in device emulation mode, the code gets the correct answer. Attempts to compile the code for the cuda gdb debugger failed with errors complaining about lack of shared memory.
I was wondering if anyone else is having trouble with CUDA 2.2? Are there any known issues with 2.2 that I should be aware of that could possibly relate to the funny behaviour I am having?