LOCAL MEM SIZE is per compute unit? newbie question

I hope this is the right forum to post a newbie question.

I’m using an Nvidia ION platform and I get the following:

CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_COMPUTE_UNITS: 2

After compiling the kernel successfully I run

err = clGetKernelWorkGroupInfo(ckKernel[0], cdDevice, CL_KERNEL_WORK_GROUP_SIZE,
                               sizeof(size_t),&szMaxLocalWorkSize, NULL);

And got szMaxLocalWorkSize of 512 workitems

Since each of my kernel required 49 bytes of local memory, 16kb/49 gives 334 workitems. Based on that I launched 334 workitems and that worked really well.

However, since it had 2 compute unit, each one having 16kBytes of local memory and 512 workitems, I figure I should be able to launch 2 sets of 334 workitems. When I did that my code bombed.

I wonder … is there something wrong with my thinking? When it says CL_DEVICE_LOCAL_MEM_SIZE is 16K does that mean 16K of local memory for each compute unit? Or is that 16k combined for all compute units? Can I have 2 different workgroup running on each of the compute unites? (i.e. do the 2 compute unites have to be instruction synchronized?).

In my host code I have 2 threads to control each of the 2 compute units. Each thread has its separate kernel (compiled from the same code), separate queues, and separate global buffers. I don’t find anything in the API for fine grained control of which kernel should run on which compute unit.

Thanks
Bill Huang

By Compute Unit I think you mean what is usually called a Multiprocessor “MP”

Each Multiprocessor has its own 16k of “shared” memory. (usually used for passing data between threads)

This 16k is split up between any blocks running on the MP, so if your kernel requires 5k of shared memory then you can have 3 blocks running on the MP at once. (a MP can process 1 to 3 blocks at once) NB threads in a block can NOT see the shared memory belonging to a different block.

A block can have from 1 to 512 threads (which is your 512 workitems)

If your code bombed I would check that you haven’t gone out of bounds on an array.

Also always call cudaGetLastError() after you call a kernel so you know what the error is.

details are in the “Error Handling” section of Cuda_C_Programming_Guide

In a typical kernel call e.g.

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

blocksPerGrid might be in the thousands, the GPU you have will handle that even though it only has 2 MP

you dont need a seperate host thread per MP (compute unit) the GPU itself handles the assigning of blocks to multiprocessors

All the Best

kbam