prevent optimization of shared memory access, but allow optimizations of global acccess

MrrVlad · November 6, 2016, 8:20pm

Hi,

I have a kernel where I use shared memory for data exchange between threads every iteration. If I have only one warp per block, i can replace __syncthreads() with __threadfence_block() and shared access is still performed correctly. However my kernel also reads/writes global memory in the same loop and __threadfence_block() also prevents optimization of global access, and I have about 60% device memory bandwidth utilization with aligned uncached reads/writes(all data is read only once). So far I’ve considered and tried the following to improve this:

use volatile for shared variables, however, this prevents optimizations in other parts of the kernel to a net loss. it’s also hard to work with when data is float4.
switch to shfl instructions. Logic becomes too complex and requires too many ifs to figure out what shuffled value to read. net loss.

Is there a way to have __threadfence_block() apply only to shared access in the same way as in openCL memfence can apply to local only? I’m ok to go to SASS if needed. This is more of an exercise at this point.

here is the example. The code has enough constants and can be fully unrolled.

#pragma unroll
for (int i = 0; i < count; i++)
{
update_indexes(a,b,c,d,idx);
float4 in = source[idx];

__threadfence_block();
left_message[a] = in;
right_message[b] = in;
__threadfence_block();

float4 left = left_message[c];
float4 right = right_message[d];
float4 out;
compute(in, left, right, out);
destination[idx] = out;

}

Robert_Crovella · November 6, 2016, 8:33pm

leave your base shared pointer as is.

Declare another volatile pointer to the shared memory.

Use the volatile pointer when you want the un-optimization.

Use the other pointer when you don’t.

You’ll have to give some thought to transition between the two domains.

MrrVlad · November 6, 2016, 9:39pm

thanks, i’ll play with volatile a bit more. One other thing i may try is to cache write out data in registers for a few iterations, so global syncs are less frequent.