Register and local mem problems 3D optical flow algorithm is too large ?

Hello !

I am working on an optical flow algorithm, I developed a working 2D version and now I have troubles with the 3D version… I work with blocks of 888 pixels since I have CL_DEVICE_MAX_WORK_GROUP_SIZE: 512.

My code is skipped without any warning or error when I try to allocate too much local memory or when the code requires too many registers.

  1. Local memory:

I have the following CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte, I suppose it means I can have a maximum of 4 blocks of float (4 bytes ?) of 8+2 pixels in my local mem ? (8 is the local size + 2 for the overlapping edges)

44101010 = 16 000: OK ?

  1. Number of register

This is my maximum number of register per block :

CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192

Does it mean: number_of_register * number_of_thread_per_block <= 8192 ?

In this case I must have less than 16 registers, which is really limited.

I gues this is a limit per kernel and not on the overall?

If so, one solution would be to divide the kernels into small ones. Unfortunately it is hardly possible with my code.

If I compile with the option “-cl-nv-maxrregcount=16” then the results is not right (In some case I got undefined numbers) probably because I reach the lower limit of how many registers I need for this particular algorithm. http://forums.nvidia.com/index.php?showtopic=193492

With no restriction (no -cl-nv-maxrregcount) the algorithm requires 23 registers.

So what can I do really ?

One thing which require some registers is the initialization of the local mem at 0.

For now I have this code :

inline int idx(int i, int j, int k, int size)

{

   return ((k*size*size)+(i*size)+j);

}

inline void initP(__local float* pLocal, int li, int lj, int lk, int lSize)

{   

   pLocal[3*idx(li,lj,lk,lSize)+0] =

   pLocal[3*idx(li,lj,lk,lSize)+1] =

   pLocal[3*idx(li,lj,lk,lSize)+2] = 0;

//if(li-1 == 0 || lj-1 == 0 || lk-1 == 0 || li+1 == lSize-1 || lj+1 == lSize-1 || lk+1 == lSize-1)

   //{

   pLocal[3*idx(li-1,lj,lk,lSize)+0] = pLocal[3*idx(li-1,lj,lk,lSize)+1] = pLocal[3*idx(li-1,lj,lk,lSize)+2] = 0;

   pLocal[3*idx(li+1,lj,lk,lSize)+0] = pLocal[3*idx(li+1,lj,lk,lSize)+1] = pLocal[3*idx(li+1,lj,lk,lSize)+2] = 0;

   pLocal[3*idx(li,lj-1,lk,lSize)+0] = pLocal[3*idx(li,lj-1,lk,lSize)+1] = pLocal[3*idx(li,lj-1,lk,lSize)+2] = 0;

   pLocal[3*idx(li,lj+1,lk,lSize)+0] = pLocal[3*idx(li,lj+1,lk,lSize)+1] = pLocal[3*idx(li,lj+1,lk,lSize)+2] = 0;

   pLocal[3*idx(li,lj,lk-1,lSize)+0] = pLocal[3*idx(li,lj,lk-1,lSize)+1] = pLocal[3*idx(li,lj,lk-1,lSize)+2] = 0;

   pLocal[3*idx(li,lj,lk+1,lSize)+0] = pLocal[3*idx(li,lj,lk+1,lSize)+1] = pLocal[3*idx(li,lj,lk+1,lSize)+2] = 0;

    pLocal[3*idx(li+1,lj-1,lk,lSize)+0] = pLocal[3*idx(li+1,lj-1,lk,lSize)+1] = pLocal[3*idx(li+1,lj-1,lk,lSize)+2] = 0;

    pLocal[3*idx(li-1,lj+1,lk,lSize)+0] = pLocal[3*idx(li-1,lj+1,lk,lSize)+1] = pLocal[3*idx(li-1,lj+1,lk,lSize)+2] = 0;

    pLocal[3*idx(li+1,lj,lk-1,lSize)+0] = pLocal[3*idx(li+1,lj,lk-1,lSize)+1] = pLocal[3*idx(li+1,lj,lk-1,lSize)+2] = 0;

    pLocal[3*idx(li-1,lj,lk+1,lSize)+0] = pLocal[3*idx(li-1,lj,lk+1,lSize)+1] = pLocal[3*idx(li-1,lj,lk+1,lSize)+2] = 0;

    pLocal[3*idx(li,lj-1,lk+1,lSize)+0] = pLocal[3*idx(li,lj-1,lk+1,lSize)+1] = pLocal[3*idx(li,lj-1,lk+1,lSize)+2] = 0;

    pLocal[3*idx(li,lj+1,lk-1,lSize)+0] = pLocal[3*idx(li,lj+1,lk-1,lSize)+1] = pLocal[3*idx(li,lj+1,lk-1,lSize)+2] = 0;

    //}

}

Would the commented if statement change anything ? I could also separate each line into an appropriate if() statement, would it change anything (except it increases the number of register) ?

Is there a way to initialize this local mem to 0 automatically ?

  1. Build Log

I would also like to know what means the info return by clGetProgramBuildInfo(cpProgram, device, CL_PROGRAM_BUILD_LOG, 4096, logTxt, NULL);.

I got something like this:

Build Log:

: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’

: Retrieving binary for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, usage mode=’ --verbose --maxrregcount 30 ’

: Considering profile ‘compute_11’ for gpu=‘sm_11’ in ‘cuModuleLoadDataEx_4’

: Control flags for ‘cuModuleLoadDataEx_4’ disable search path

: Ptx binary found for ‘cuModuleLoadDataEx_4’, architecture=‘compute_11’

: Ptx compilation for ‘cuModuleLoadDataEx_4’, for gpu=‘sm_11’, ocg options=’ --verbose --maxrregcount 30 ’

ptxas info : Compiling entry function ‘opticalFlow’ for ‘sm_11’

ptxas info : Used 15 registers, 40+16 bytes smem, 199 bytes cmem[0], 20 bytes cmem[1]

ptxas info : Compiling entry function ‘rof’ for ‘sm_11’

ptxas info : Used 23 registers, 12+16 bytes smem, 199 bytes cmem[0], 16 bytes cmem[1]

ptxas info : Compiling entry function ‘warp’ for ‘sm_11’

ptxas info : Used 8 registers, 16+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]

What is sm_11 ?

And what are smem, cmem[0] and cmem[1] ? What are the limits I should not exceed for each mem ?

Thanks a lot !

Arthur

For those who have problems with the number of register:

“The available number of registers is always a huge problem, most times a
simple splitting of the algorithm into multiple kernels is the fastest approach (depending
on the CC of the GPU).”