__private memory questions

What’s better to get more speed?

Option 1:

__private int2 buffer[60];

...

buffer[0] = (int2)(0,1);

...

or

Option 2:

__private int bufferA[60];

__private int bufferB[60];

...

bufferA[0] = 0;

bufferB[0] = 1;

...

or int4 perhaps?

and… should I use an array of 64(or 32, 16, 8, … ) pow2 elements or can I use a value like 60, 40, 20? I bet pow2 array sizes will be better to align internally the buffer.

And… before you ask, I CANNOT use shared memory, must be private local memory.

And… what’s the limit of private local memory for G80, GT200 and Fermi cards, pls? 512 bytes per thread? 256Kb per thread/block? I need to know the limit, pls.

thx

To my understanding, private memory are “spreaded out” with each work item (probably by warp). So you don’t have to worry much about alignment problems, as they are likely to be “auto-aligned” with the number of threads.

To my understanding, private memory are “spreaded out” with each work item (probably by warp). So you don’t have to worry much about alignment problems, as they are likely to be “auto-aligned” with the number of threads.

Each multiprocessor has 16384 registers, I think. Fermi should have double of that. These registers are shared among all warps assigned to an SM, so how much is available depends on workgroup size and the like. The Compute Profiler can dump information about register usage.

I’d be careful with __private memory, normally it’s not a good idea to use it for big arrays; use shared memory (__local in OpenCL) or __global memory instead. Why is that not possible in your opinion? If you run out of registers, it’s going to be very slow!

For __private register memory access pattern shouldn’t matter though.

Each multiprocessor has 16384 registers, I think. Fermi should have double of that. These registers are shared among all warps assigned to an SM, so how much is available depends on workgroup size and the like. The Compute Profiler can dump information about register usage.

I’d be careful with __private memory, normally it’s not a good idea to use it for big arrays; use shared memory (__local in OpenCL) or __global memory instead. Why is that not possible in your opinion? If you run out of registers, it’s going to be very slow!

For __private register memory access pattern shouldn’t matter though.

Because I use 128 threads/block and I need 2 sets of 64 elements int array, so I use 2*sizeof(int)64 = 512bytes per thread * 128 threads/block, so I’ll need 64Kb of shared memory per work group and 64Kb8 active blocks max=512Kb for each compute unit… but Fermi has 32Kb for the whole compute unit… My only option is to use the __private or __global.

Curiously, passing a -cl-nv-verbose option I see a 512 lmem, 512 cmem log… so it seems the Cg OpenCL JIT is using constant memory to do the trick… I’m not sure how.

Because I use 128 threads/block and I need 2 sets of 64 elements int array, so I use 2*sizeof(int)64 = 512bytes per thread * 128 threads/block, so I’ll need 64Kb of shared memory per work group and 64Kb8 active blocks max=512Kb for each compute unit… but Fermi has 32Kb for the whole compute unit… My only option is to use the __private or __global.

Curiously, passing a -cl-nv-verbose option I see a 512 lmem, 512 cmem log… so it seems the Cg OpenCL JIT is using constant memory to do the trick… I’m not sure how.

Global memory w/ proper coalescing should give you the best performance then.
AFAIK __private always means either register or local memory (which is just as slow as global memory).

Global memory w/ proper coalescing should give you the best performance then.
AFAIK __private always means either register or local memory (which is just as slow as global memory).