atomicAdd support Double Which card?

Hi,

Which card model is going to support/supports atomicAdd working on double types?

Thaks

Hi,

Which card model is going to support/supports atomicAdd working on double types?

Thaks

None of them, as far as I can tell. Fermi adds 32 bit floating point support for AtomicAdd(), but that’s it.

None of them, as far as I can tell. Fermi adds 32 bit floating point support for AtomicAdd(), but that’s it.

Does that mean we can only use float at maximum if we need to use atomic operations?

If the values we have is double with 6 decimal places, assign them to float variables, how much accuracy we will lose? If that is okey, then we can use the converted float variables with the atomic operations.

Does that mean we can only use float at maximum if we need to use atomic operations?

If the values we have is double with 6 decimal places, assign them to float variables, how much accuracy we will lose? If that is okey, then we can use the converted float variables with the atomic operations.

Yes, that means that you can only use single-precision floats with atomic operations.

However, single-precision float is accurate to ~7 decimal places, so in this case, you shouldn’t lose any precision.

Yes, that means that you can only use single-precision floats with atomic operations.

However, single-precision float is accurate to ~7 decimal places, so in this case, you shouldn’t lose any precision.

Thanks for the reply. The card we have now is only support Integer with the Atomic operation. Does that mean when we assign the double value to a integer type variable, we will lose all the digits after the decimal points?

By the way, I am very interested in the GPU.Net. I will give a it try when the beta is released.

Thanks for the reply. The card we have now is only support Integer with the Atomic operation. Does that mean when we assign the double value to a integer type variable, we will lose all the digits after the decimal points?

By the way, I am very interested in the GPU.Net. I will give a it try when the beta is released.

You can write your own atomic function using the other ones:

[codebox] device inline void myAtomicAdd(double *address, double value) //See CUDA official forum

{

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback=atomicCAS((unsigned long long *)address, oldval, newval)) != oldval)

{

    oldval = readback;

    newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}[/codebox]

You can write your own atomic function using the other ones:

[codebox] device inline void myAtomicAdd(double *address, double value) //See CUDA official forum

{

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback=atomicCAS((unsigned long long *)address, oldval, newval)) != oldval)

{

    oldval = readback;

    newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}[/codebox]

Hi Magorath,

Thanks for the information.

How is the performance of this custom atomicadd function? Which CUDA official forum(in the code comments) is it from?

Why NVIDIA does not build this function with the the current cards?

Hi Magorath,

Thanks for the information.

How is the performance of this custom atomicadd function? Which CUDA official forum(in the code comments) is it from?

Why NVIDIA does not build this function with the the current cards?

Poor. A thread might have to make many attempts before the CAS operation is successful. I haven’t ever fully been convinced that it is formally correct either.

Because it is very complex to do in hardware. Atomic operations are done in the memory controller and caches. Performing an atomic add of a 64 bit floating point number requires what is effectively a full 64 bit FPU into the memory controller. That is a lot of transistors for little real world application.

My gut feeling is that if you think you need 64 bit atomic floating point operations, you are probably using the wrong algorithmic approach.

Poor. A thread might have to make many attempts before the CAS operation is successful. I haven’t ever fully been convinced that it is formally correct either.

Because it is very complex to do in hardware. Atomic operations are done in the memory controller and caches. Performing an atomic add of a 64 bit floating point number requires what is effectively a full 64 bit FPU into the memory controller. That is a lot of transistors for little real world application.

My gut feeling is that if you think you need 64 bit atomic floating point operations, you are probably using the wrong algorithmic approach.

Hi avidday, thanks for your reply.

The data we marshalled from C# application are all in type double so the simulation results calculated by each GPU thread is of type double.

We need the atomicadd function so that we can add the results from each thread into the global memory. ( The recution algorithm is not suitable for our model).

What will be the best solution for the scenario? Do we need to convert all the double to float?

If Inter/Float atomic operation is our only choices now, when we do the conversion by assigment from double to float/Integer(so we can use atomic functions), will the truncation cause much problem?

Hi avidday, thanks for your reply.

The data we marshalled from C# application are all in type double so the simulation results calculated by each GPU thread is of type double.

We need the atomicadd function so that we can add the results from each thread into the global memory. ( The recution algorithm is not suitable for our model).

What will be the best solution for the scenario? Do we need to convert all the double to float?

If Inter/Float atomic operation is our only choices now, when we do the conversion by assigment from double to float/Integer(so we can use atomic functions), will the truncation cause much problem?

I am willing to be that you can (and probably should) be using a parallel reduction or prefix sum for this. You might believe that you need atomic operations, but my experience with what I expect are very comparable simulation applications tells me it is almost never the case.

I am willing to be that you can (and probably should) be using a parallel reduction or prefix sum for this. You might believe that you need atomic operations, but my experience with what I expect are very comparable simulation applications tells me it is almost never the case.