I have the following situation - a thread in a block writes float2 where .x component contains magic word and .y contains data I need to pass to another thread. Another thread in another block loops reading the float2 variable and checking if .x has become equal to the magic value. Question is: does .y component always contain a value which was written together with magic word in .x ? Does it work the same way for float4 types ?
That sounds like a major deadlock issue waiting to happen - there’s no guarentee when the other block will actually become active - the device can only run a certain number of blocks at a time. The reading block(s) could very well end up scheduled first, which would deadlock if there wasn’t any room left for the signaling block(s), since they would be waiting for a signal that could only come from a block that’s waiting for them to finish and clear up an execution slot so it can run.
That said, let’s assume that your algorithm doesn’t care which block receives the message (all blocks could process it), but just that the message is intact. In that case, the deadlock situation could be avoided. Here, float2 and float4 are not atomic by themselves, but can be made to be with some clever staging in shared memory. The key is that although any given thread can only read/write a single 32 bit word atomically, each half-warp will issue a larger memory transaction if the accesses from the individual threads are coelesced. The requirements for this vary between different compute capacity devices, but at minimum, if you have each thread in the half-warp accessing a consecutive word in a 16 word aligned segment, it will be coelesced (see the programming guide!). Using this, we can have code like this:
__shared float stageingBuffer[4];
float4 msg;
float* output;
//calculate msg
if (threadId.x == 0)
{
stagingBuffer[0] = msg.x;
stagingBuffer[1] = msg.y;
stagingBuffer[2] = msg.z;
stagingBuffer[3] = msg.w;
}
__syncthreads();
if (threadId.x < 4)
output[threadId.x] = stagingBuffer[threadId.x];
I haven’t tested this, but I think it should work, assuming you’ve dealt with the deadlock issue.
I could always schedule 1 block per multi-processor by declaring maximum shared mem usage per block and launching as many blocks as I have multiprocessors in the system, could I ?
“atomic” instructions are for writes from multiple threads to the same location. This is not the situation here.
I think you can write float2/float4 data as normal and then write a flag to signal the other threads to read. I believe that the flag will arrive after the float2/float4 data because they are written from the same thread.
There is no guarantee that writes to global memory will be immediately seen by other blocks, however __threadfence() ensure this. Therefore you can:
-
store data
-
__threadfence()
-
store flag
That won’t help against multiple writes though and I am not fully convinced that Keldor314’s solution would work either.
I don’t really care if the write will be delayed or not, I just would like to know if other thread can read partially written float2.