I have 8 warps in a threadblock, they all need to do some calculation by tensor core and then store the result back to somewhere of the share memory. However, some warps may write to the same place. Is there any efficient schedule algorithm can help to avoid conflicting instead the use of atomicAdd?
Please give an example / show code of what you want to achieve.
Is it not possible to let all warps write to different locations?
For example, if I need to calculate convolution of few points that has large in_channels and small out_channels (like 256, 16), it is inevitable to have two wmma operations count for a same point place
How about something similar to this:
//compute using 8 warps
//...
int slot = ... //up to you
if(warpId < 4){
smem[slot] = result;
}
__syncthreads();
if(warpId >= 4){
smem[slot] += result;
}
__syncthreads();
The exact slot and conditions obviously depends on which warps need to store to which location.
Thanks! But actually, I think slot is hard to correspond with the specific warp. So I prefer a scheduler like token ring or something else.
Scheduler like token ring sounds a bit too complex and slow for your example.