The float and float4 types in CUDA

How is a float and float4 physically stored in CUDA?

And could a cudaMemcpyAsync change the input value of a float or float4?

I ask because my code uses lots of cudaMemcpyAsync to transfer float and float4 between devices. But after 900 iterations a 4-device run produces a difference of 2 in the 10th dp to the same variable from the 1-device run. There is no transfer required to calculate this variable in the 1-device run, but in the 4-device run there is a need for each device to transfer a float produced by a cub::DeviceReduce::Max or cub::DeviceReduce::Sum.

Should I be concerned about this considering I am using float and the difference is in the 10th dp?

All platforms supported by CUDA use little-endian byte ordering and use IEEE-754 (2008) floating-point types, with ‘float’ mapped to IEEE-754 ‘binary32’.

CUDA API copy operations copy data, they don’t modify it.

You can see from the header files that float4 is simply a struct of four ‘float’ components named ‘x’, ‘y’, ‘z’, and ‘w’ (to which a 128-bit alignment restriction is applied), where ‘x’ is the least significant component and ‘w’ the most significant one.

The description of the code is insufficient to determine if anything goes wrong with it, and if so, where. The code may have bugs, e.g. invoking undefined behavior, or it may contain non-deterministic code paths, such as atomics operating on floating-point data (as opposed to mathematics, floating-point operations are not associative).

You might want to check the device code with cuda-memcheck, and the host code with valgrind.

‘float’ provides an accuracy of almost seven decimal digits. If you need higher accuracy, consider using the ‘double’ type, which provides accuracy of about 16 decimal digits.