Issues with double precision support on GT200

I recently obtained a GTX 260 and wanted to test double precision support. I figured a simple way to do it would be to switch all the float variables to double in the matrixMul example from the SDK.

I did that and it compiles and runs without errors but the numbers are completely wrong. Not only that but commenting out the kernel call still gives me the exact same numbers so it seems either the kernel is not doing anything or it’s writing to the wrong part of memory.

I am confused about how double precision is supported. I’m running the latest version of CUDA (2 beta2, on linux) which I thought would be all that was needed.

Just to make it clear here are a couple of snippets from the code (but it is literally just the matrixMul example with the float variables changed to double):



unsigned int mem_size_C = sizeof(double) * size_C;

// allocate device memory for result

double* d_C;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size_C));

__global__ void

matrixMul( double* C, double* A, double* B, int wA, int wB)

__shared__ double As[BLOCK_SIZE][BLOCK_SIZE];

While your problem does not sound like it has anything to do with that, make sure you are compiling for compute model 1.3, otherwise you will only get single precision.
I assume you changed the randomInit function etc. to double, too?
Are the values calculated on the CPU via computeGold correct?

Dats because the cudaMalloc() might have been the same between the previous run (with kernel) and the next run (without invoking kernel). Since the global memory was not over-written – you got the same output! – possible reason…

I have seen such behaviour before.

Where do I check if I’m compiling for 1.3?

I did change all the helper functions and computegold is working fine as I get basically the same numbers there as with single precision.

That is a good point and I’ve seen it happen before but it’s not the case here as I’ve tried changing the kernel on purpose so it would give me different numbers but they stayed the same.

You need to pass -arch compute_13 or -code compute_13 (depending on compilation target I think) to nvcc, note that this is only in theory I only tried it once and it did not work (probably I did something stupid).

Best compile the code with -ptx (or better with -cubin and use decuda) to verify it actually generates double-precision operations.

To enable double precision, you need to pass the flag “-arch sm_13” to nvcc

nvcc -arch sm_13

Thanks, this worked.