CUDA=>OpenCL : work-groups questions

In fact I have a problem to translate some CUDA code to OpenCL :

int tasksCount = 489984;

int desiredWarps = (tasksCount + 31) / 32;

Vec2i blockSize(32, 4);

int blockWarps = (blockSize.x * blockSize.y + 31) / 32;

Vec2i gridSize((desiredWarps + blockWarps - 1) / blockWarps, 1);

return module->launchKernelTimed(kernel, blockSize, gridSize);

Hi

I have the following code in CUDA :

//int tasksCount = 489984;	

int blockWidth = 32; // One warp per row.

int blockHeight = 4; // 4*32 = 128 threads, optimal for GTX480

		

localWork = cl::NDRange(blockWidth, blockHeight);

int globalWidth = tasksCount / blockHeight;

if ( globalWidth % blockWidth != 0 )

	globalWidth = (globalWidth / blockWidth + 1) * blockWidth;

globalWork = cl::NDRange(globalWidth, blockHeight);

On the kernel levels I use this :

rayidx = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * (blockIdx.x + gridDim.x * blockIdx.y));

into

rayidx  = get_global_id(1) * get_global_size(0) + get_global_id(0);

But honnestly I’m not convince that it is correct. Do you have an advice ?

Thanks

local looks good. think your globalWork dimensions are off.
pseudo:

while(globalWork < tasks)
globalWork += localWork

make sure globalWork is multiple of localWork in all dimensions, and all dimensions of both obey the limits of your device.

for the limits see:

CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS
CL_DEVICE_MAX_WORK_ITEM_SIZES

quick snip:
maxWorkGroupSize = getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>();
maxWorkItemDimensions = getInfo<CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS>();
maxWorkItemSizes.reserve(maxWorkItemDimensions);
for(uint i = 0; i < maxWorkItemDimensions;i++) maxWorkItemSizes[i] = 0;
clGetDeviceInfo((*this)(),CL_DEVICE_MAX_WORK_ITEM_SIZES,sizeof(size_t)*maxWorkItemDimensions, &maxWorkItemSizes[0], NULL);