Register read/write takes too long

Hello all!

I better start my question with an example, as it is easier to explain and to understand.

Ok, so I have the following kernel:

__kernel void nlj_flags(__global const T *left, int left_len, __global const T *right, int right_len, __global int *flags) {

	unsigned int thread_id = get_global_id(0);

	

	if(thread_id < left_len) {

	

		int local_total = 0;

		

		flags[thread_id] = 0;

		

		int i;

		for(i = 0; i < right_len; i++) {

			if(left[thread_id] == right[i]) {

				local_total += 1;

			}

		}

		

		barrier(CLK_GLOBAL_MEM_FENCE);

		

		flags[thread_id] = local_total;

	}

}

When profiling it, I’ve noticed that

flags[thread_id] = local_total;

takes most of the time. At first I thought it could be related to un-coalesced writes to global memory, so I’ve added a synchronization point. However, if instead of writing the local_total variable I write some other variable, it doesn’t take that long. Is it because local_total needs to be flushed or something before can be read? Or the problem is it really the problem of un-coalesced writes to global memory?

How did you find which line takes the most time? if you dont write local_total than your code probably will not calculate it and this is why it will work faster, it is not register dependency which actually is hiden by other work groups. Removing barrier dont change any thing in coalescing, writes are per warp, so even if other warps didnt reached write all threads from each warp reach it in the same time and perform fully coalesced write in your case.

ps. i would remove initial write to flags with 0;)

How did you find which line takes the most time? if you dont write local_total than your code probably will not calculate it and this is why it will work faster, it is not register dependency which actually is hiden by other work groups. Removing barrier dont change any thing in coalescing, writes are per warp, so even if other warps didnt reached write all threads from each warp reach it in the same time and perform fully coalesced write in your case.

ps. i would remove initial write to flags with 0;)

if you are sure its that step, check again if you are spilling over into local memory.

That usually happens when u try to dynamically index arrays on registers…

How can I check that?

use the --ptxas-options=v while compiling, that prints out the kernel specs…

or u can also look for global load/store ops in the profiler that would also give you a fair idea whether you spilling over into local mem

This makes sense. The compiler is hopefully smart enough to realize this.

Taking in consideration the reading pattern to the right buffer, I’m probably getting coalesced reads some of the time.

Any ideas to improve this?

your reads of right are all coalesced, but you always read only one value for all threads, so it is a big waste of bandwidth.

try this code, should do what you need, and probably much faster, i didnt checked this code, so it might not compile, but you can see my idea.

local_buff_size must be less or equal to work group size.

#define local_buff_size 32

__kernel void nlj_flags(__global const T *left, int left_len, __global const T *right, int right_len, __global int *flags) {

	unsigned int thread_id = get_global_id(0);

	unsigned int local_id = get_local_id(0);

	__local T right_local[local_buff_size];

	int local_total = 0;

	int current = 0;

	T my_left = left[thread_id];

	while (current < right_len)

	{

		int my_read = current + local_id;

		if (local_id < local_buff_size)

			right_local[local_id] = right[my_read];

		barrier(CLK_LOCAL_MEM_FENCE);

		if (thread_id < left_len)

		{

			int total_right = min(local_buff_size, right_len - current);

			for (unsigned k = 0; k < total_right; k++)

			{

				if (right_local[k] == my_left)

					total_local++;

			}

		}

		current += local_buff_size;

	}

	if (thread_id < left_len)

		flags[thread_id] = local_total;

}