Atomic Add with Doubles

My program requires an atomic add operation to accumulate absorbed “energy” by each primitive. I have been running into precision issues using floats, so I have been trying to implement a double precision version. Since I have compute capability <6.0, it is recommended to implement atomicAdd for doubles using atomicCAS as suggested below

#if __CUDA_ARCH__ < 600 
__device__ double atomicAdd(double* address, double val) { 
unsigned long long int* address_as_ull = (unsigned long long int*)address; 
unsigned long long int old = *address_as_ull, assumed; 
do { 
assumed = old; 
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); 
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN) } 
while (assumed != old); 
return __longlong_as_double(old); 

Using this code will successfully compile and generate the ptx, however it fails the OptiX compilation during runtime with the following error:

OptiX Error: Unknown error (Details: Function “RTresult _rtProgramCreateFromPTXFile(RTcontext, const char*, const char*, RTprogram_api**)” caught exception: Compile Error: Cannot perform indirect call to functions which are not callable programs. at: [ Instruction: %352 = call i64 bitcast (float (%“struct.cort::CanonicalState”, %“struct.cort::UberPointer”, i32, i1, float, float) @_UberPointerMemAtomicOp to i64 (%“struct.cort::CanonicalState”, %“struct.cort::UberPointer”, i32, i1, i64, i64))(%“struct.cort::CanonicalState”* %0, %“struct.cort::UberPointer” %339, i32 7, i1 false, i64 %“%rd196.0”, i64 %351), contained in basic block: BB1_11, in function: _Z19closest_hit_diffusev_cp7, in module: Canonical__Z19closest_hit_diffusev

It seems to be failing because of the call to atomicCAS, because if I comment out only that line it will run without error.

Please let me know if you have any ideas on how to get around this. I am using OpitX version 4.0.1 with CUDA 8.0 and compute capability 3.5 GPU.

I’ll file a bug report to check if atomicCAS is handled properly, but it probably won’t get any attention until after Siggraph. Did you check if splitting the “double_as_longlong” call into a temporary, to simplify the atomicCAS call signature, makes any difference?

Sometimes as an alternative to atomics you can accumulate all values into a larger buffer, then do the reduction as a post process with CUDA interop. This scheme breaks down if the buffer size gets large enough, which I think in your case would be a large enough number of primitives or number of rays. Just mentioning this as a possible workaround.

Go ahead and send us a simple trace for this if you have time. Bug reports get more attention with a customer reproducer.