Problem with local memory access

I’m currently working on a project using OpenCL on a Tesla C1060 (driver version 195.17). However I’m getting some strange behaviour I can’t really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):

[codebox]kernel void TestKernel(global const int* groupOffsets, global float* result,

                   local        int* tmpData,             const int    itemcount)

{

unsigned int groupid = get_group_id(0);

unsigned int globalsize = get_global_size(0);

unsigned int groupcount = get_num_groups(0);

for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)

{

  barrier(CLK_LOCAL_MEM_FENCE);

  if(get_local_id(0) == 0)

     tmpData[0] = groupOffsets[groupid]; 

  barrier(CLK_LOCAL_MEM_FENCE);

  int offset = tmpData[0];

  result[id]   = (float) offset;

}

}[/codebox]

This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value. My output vector (for workgroupsize=128) is as following:

[codebox]

index 0: 0

index 1- 31: 470400

index 32-127: 0

index 128: 640

index 129-159: 471040

index 160-255: 640

index 256: 1280

index 257-287: 471680

index 288-511: 1280

[/codebox]

the output i expected would be

[codebox]

index 0-127: 0

index 128-255: 640

index 256-511: 1280

[/codebox]

Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I’m guessing it has something to do with the loop. Does anyone know what I’m doing wrong and how to fix it?

Just a wild guess, but is it guaranteed that the loop will be repeated for the same number of times for all work-items? You are using barriers inside the loop and a barrier waits for all work-items in a work-group. So in your case it basically means that itemcount must be equal to k*globalsize (where k is 1,2,3… ). Is this condition satisfied?

Good point. It actually wasn’t guaranteed that this condition was satisfied (gotta fix this), however so far it doesn’t make any difference either (itemcount is somewhere around 100000, so if that was the problem I wouldn’t expect to see that kind of problem for the first items anyway).

however I found out that the kernel will output the correct values if I change the condition to

[codebox]

if(get_local_id(0) < 32)

tmpData[0] = groupOffsets[groupid];

[/codebox]

However while this seems to fix the problem, this solution feels kind of ugly (and would be less efficient when run on 8xxx class hardware), so the question remains.

Why do you use a loop? Hasn’t “itemcount” the same value as “globalsize”?

Hence your loop is executed once (with initial value → id = get_global_id(0)) and then it breaks up.

Here’s how i would solve your problem:

__kernel void Offset(__global int* result,

					 __global const int* groupOffsets,

					 __local int* tmpData)

{

	int group_id = get_group_id(0);

	int global_id = get_global_id(0);

	//copy group offsets from global to local memory

	tmpData[group_id] = groupOffsets[group_id];

	//sync

	barrier(CLK_LOCAL_MEM_FENCE);

	//copy data from local to global result memory

	result[global_id] = tmpData[group_id];

}

The results i getting are as you would expect them.

No itemcount has a different value then globalsize, thats the whole point of what I’m trying to do right now.
As I have stated in my first post, for itemcount==globalsize I’m getting the results I expected (the first version I had was somewhat like the one you posted, so right now I’m trying to move away from that), however I’m trying to make this working for itemcount<globalsize, to remove overhead (in the actual code there are some more operations which need only be applied once per workitem instead of once per element).

Besides even more then solving this (as stated there are several options of fixing it, they are just all ugly (having itemcount==globalworksize, copying groupoffsets from more then one workitem), I’m hoping to understand why this doesn’t work.