Hi,
I am building something in which multiple threads update random float values/vectors in global device memory a bit like a histogram problem. It is possible that multiple threads need to update the same value/vector. Since these values add or subtract something, it does not matter in which order the threads update the global memory. It is just important they all do.
The array with float values/vectors to be updated is too large to store in shared memory.
How to tackle this problem with compute capability 1.1 ?
AtomicAdd can not be used with float only in CC2.0
Building semaphores in shared memory that block a thread when another is updating, can only be done with atomicInc/atomicDec which are only availble for shared memory with CC1.2
Does anyone have an idea how to proceed here?
thanks