Problem using double precision arithmetic on GT200 Incorrect results using double precision

I get incorrect results while using the double precision arithmetic on a GT200. I run under Linux using the latest driver, toolkit and SDK releases. Here is what I get from deviceQuery

Device 0: “Quadro NVS 290”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 267714560 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes

Device 1: “GT200”
Major revision number: 1
Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.51 GHz
Concurrent copy and execution: Yes

Test PASSED

To test the double precision and to illustrate the problem that I get I modified the simpleCUBLAS example to test cublasDgemm (attached). I set the device with

int device = 1;
cudaSetDevice(device);

before cublasInit(). The program runs correctly for N = 32 and some other small values of N, but for N = 64 (and other bigger numbers) I get a big difference with the reference implementation, e.g.
for N = 64
||error|| = 44.622281

If I set device with cudaSetDevice(device) after cublasInit() function cublasGetError(), called after cublasDgemm, returns value different from CUBLAS_STATUS_SUCCESS.

The GT200 works in single precision. The T10P with reduced capability worked in double as well but it was the only card installed on our system. We replaced it (on the same system) with the new T10P and the only difference is that we added a second card.

Any advice on what may be getting wrong?