I’m currently working on a project using OpenCL on a Tesla C1060 (driver version 195.17). However I’m getting some strange behaviour I can’t really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
[codebox]kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
{
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value. My output vector (for workgroupsize=128) is as following:
[codebox]
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
…
[/codebox]
the output i expected would be
[codebox]
index 0-127: 0
index 128-255: 640
index 256-511: 1280
…
[/codebox]
Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I’m guessing it has something to do with the loop. Does anyone know what I’m doing wrong and how to fix it?
Just a wild guess, but is it guaranteed that the loop will be repeated for the same number of times for all work-items? You are using barriers inside the loop and a barrier waits for all work-items in a work-group. So in your case it basically means that itemcount must be equal to k*globalsize (where k is 1,2,3… ). Is this condition satisfied?
Good point. It actually wasn’t guaranteed that this condition was satisfied (gotta fix this), however so far it doesn’t make any difference either (itemcount is somewhere around 100000, so if that was the problem I wouldn’t expect to see that kind of problem for the first items anyway).
however I found out that the kernel will output the correct values if I change the condition to
[codebox]
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
[/codebox]
However while this seems to fix the problem, this solution feels kind of ugly (and would be less efficient when run on 8xxx class hardware), so the question remains.
Why do you use a loop? Hasn’t “itemcount” the same value as “globalsize”?
Hence your loop is executed once (with initial value → id = get_global_id(0)) and then it breaks up.
Here’s how i would solve your problem:
__kernel void Offset(__global int* result,
__global const int* groupOffsets,
__local int* tmpData)
{
int group_id = get_group_id(0);
int global_id = get_global_id(0);
//copy group offsets from global to local memory
tmpData[group_id] = groupOffsets[group_id];
//sync
barrier(CLK_LOCAL_MEM_FENCE);
//copy data from local to global result memory
result[global_id] = tmpData[group_id];
}
The results i getting are as you would expect them.
No itemcount has a different value then globalsize, thats the whole point of what I’m trying to do right now.
As I have stated in my first post, for itemcount==globalsize I’m getting the results I expected (the first version I had was somewhat like the one you posted, so right now I’m trying to move away from that), however I’m trying to make this working for itemcount<globalsize, to remove overhead (in the actual code there are some more operations which need only be applied once per workitem instead of once per element).
Besides even more then solving this (as stated there are several options of fixing it, they are just all ugly (having itemcount==globalworksize, copying groupoffsets from more then one workitem), I’m hoping to understand why this doesn’t work.