I better start my question with an example, as it is easier to explain and to understand.
Ok, so I have the following kernel:
__kernel void nlj_flags(__global const T *left, int left_len, __global const T *right, int right_len, __global int *flags) {
unsigned int thread_id = get_global_id(0);
if(thread_id < left_len) {
int local_total = 0;
flags[thread_id] = 0;
int i;
for(i = 0; i < right_len; i++) {
if(left[thread_id] == right[i]) {
local_total += 1;
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
flags[thread_id] = local_total;
}
}
When profiling it, I’ve noticed that
flags[thread_id] = local_total;
takes most of the time. At first I thought it could be related to un-coalesced writes to global memory, so I’ve added a synchronization point. However, if instead of writing the local_total variable I write some other variable, it doesn’t take that long. Is it because local_total needs to be flushed or something before can be read? Or the problem is it really the problem of un-coalesced writes to global memory?
How did you find which line takes the most time? if you dont write local_total than your code probably will not calculate it and this is why it will work faster, it is not register dependency which actually is hiden by other work groups. Removing barrier dont change any thing in coalescing, writes are per warp, so even if other warps didnt reached write all threads from each warp reach it in the same time and perform fully coalesced write in your case.
ps. i would remove initial write to flags with 0;)
How did you find which line takes the most time? if you dont write local_total than your code probably will not calculate it and this is why it will work faster, it is not register dependency which actually is hiden by other work groups. Removing barrier dont change any thing in coalescing, writes are per warp, so even if other warps didnt reached write all threads from each warp reach it in the same time and perform fully coalesced write in your case.
ps. i would remove initial write to flags with 0;)
your reads of right are all coalesced, but you always read only one value for all threads, so it is a big waste of bandwidth.
try this code, should do what you need, and probably much faster, i didnt checked this code, so it might not compile, but you can see my idea.
local_buff_size must be less or equal to work group size.
#define local_buff_size 32
__kernel void nlj_flags(__global const T *left, int left_len, __global const T *right, int right_len, __global int *flags) {
unsigned int thread_id = get_global_id(0);
unsigned int local_id = get_local_id(0);
__local T right_local[local_buff_size];
int local_total = 0;
int current = 0;
T my_left = left[thread_id];
while (current < right_len)
{
int my_read = current + local_id;
if (local_id < local_buff_size)
right_local[local_id] = right[my_read];
barrier(CLK_LOCAL_MEM_FENCE);
if (thread_id < left_len)
{
int total_right = min(local_buff_size, right_len - current);
for (unsigned k = 0; k < total_right; k++)
{
if (right_local[k] == my_left)
total_local++;
}
}
current += local_buff_size;
}
if (thread_id < left_len)
flags[thread_id] = local_total;
}