volatile keyword for class members


in my cuda kernels I have many objects in global memory of a class say ‘Entity’. One of the class member
should be manipulated and read from threads of different blocks.
Simplified it looks like

class Entity
    __inline__ __device__ void setData(float value)
         atomicExch(&_data, value);

    __inline__ __device__ float getData()
        return _data;

    float _data;
    //... and many other attributes

I’ve observed that sometimes a thread gets a wrong value when calling getData().
I suspect that in spite of the atomic operation in setData() this value will be somehow cached.

What can I do here?

I tried to use volatile keyword here but it doesn’t work.
As a workaround I implemented the following getter:

__inline__ __device__ float getData()
        return atomicAdd(&_data, 0);

which worked but seems like a primitive hack…

Thanks for help!