Performance Issue

I’m working on an algorithm that makes an extensive usage of bitwise operations, here a sample:

#define UINT64 unsigned long long


#define index(x, y) (((x)%5)+5*((y)%5))

// NVCC Bug

//#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) << offset) ^ (((UINT64)a) >> (64-offset))) : a)

__device__ inline UINT64 ROL64(UINT64 a, unsigned int offset)


   const int _offset = offset;

   return ((offset != 0) ? ((a << _offset) ^ (a >> (64-offset))) : a);


__global__ void kernel()



   for(i = 0, i < 23; i++)



      A[x] = A[x] ^ C[x];

      C[x] = A[index(x, y)] ^ ((~A[index(x+1, y)]) & A[index(x+2, y)]);

      D[x] = ROL64(C[x], 1);


      // Constant Memory access




The performance I obtained are extremely poor despite I didn’t make usage of Global Memory nor synchronizing primitives (threads are unrelated).

I tried also to remove modulo operations but the situation remained almost the same.

I’m benchmarking on a device with compute cap. 1.2, Could it be a problem with the type UINT64 (64 bit)?

What are [font=“Courier New”]A[/font], [font=“Courier New”]C[/font] and [font=“Courier New”]D[/font]? If they are not in shared memory, they will be placed in local memory which is just as slow as global memory.

“A” is a local variable.

I knew that local variable were stored into registers and for this reason the access is extremely fast…

Local arrays will only be stored in registers if all indices can be resolved to constants at compile time.

Loop unrolling can help to achieve this. Otherwise you’ll have to use shared memory (or contend with a slower kernel).

I didn’t know it… thank you.

However in my case it happens. I tried with a kernel completely unrolled and the performance were actually better, but not enough.

How can I verify where the local variables were stored?

What do you think of the UINT64 issue?

Thank you very much.


I forgot to mention that A, B, C are array of UINT64…

Compile with [font=“Courier New”]–ptxas-options=-v[/font] and check that no local memory use is reported.

It might be that the compiler does not optimize as well for 64 bit variable as for 32 bit (apart from the obvious doubled register and instruction need). But memory issues should definitely be sorted out before looking at that.

This is the output…

ptxas info    : Used 12 registers, 480+0 bytes lmem, 16+16 bytes smem, 292 bytes cmem[0], 12 bytes cmem[1]

I removed all the array index calculations but some local memory is still used.

I cannot figure out the reason.

If the arrays are using 480 bytes of local memory, that would be equivalently 120 registers. I doubt the compiler would ever be able to accomodate that size array in register, the theoretical PTX limit is 128 registers per thread, in practice a few less than that.

I agree, but why with this simple code

#typedef unsigned long long UINT64

__global__ void kernel(UINT64 *messages_d, UINT64 *state_d)


   UINT64 A[5];

// Absorbing

   A[0] = state_d[0] ^ messages_d[0];

state_d[0] = A[0];


I obtain this

ptxas info    : Compiling entry function '_Z6kernelPyS_' for 'sm_11'

ptxas info    : Used 6 registers, 40+0 bytes lmem, 16+16 bytes smem, 292 bytes cmem[0]