I have a kernel where I use shared memory for data exchange between threads every iteration. If I have only one warp per block, i can replace __syncthreads() with __threadfence_block() and shared access is still performed correctly. However my kernel also reads/writes global memory in the same loop and __threadfence_block() also prevents optimization of global access, and I have about 60% device memory bandwidth utilization with aligned uncached reads/writes(all data is read only once). So far I’ve considered and tried the following to improve this:
- use volatile for shared variables, however, this prevents optimizations in other parts of the kernel to a net loss. it’s also hard to work with when data is float4.
- switch to shfl instructions. Logic becomes too complex and requires too many ifs to figure out what shuffled value to read. net loss.
Is there a way to have __threadfence_block() apply only to shared access in the same way as in openCL memfence can apply to local only? I’m ok to go to SASS if needed. This is more of an exercise at this point.
here is the example. The code has enough constants and can be fully unrolled.
for (int i = 0; i < count; i++)
float4 in = source[idx];
__threadfence_block(); left_message[a] = in; right_message[b] = in; __threadfence_block(); float4 left = left_message[c]; float4 right = right_message[d]; float4 out; compute(in, left, right, out); destination[idx] = out;