Performance Issue

GiuseppeC · February 2, 2011, 11:22am

I’m working on an algorithm that makes an extensive usage of bitwise operations, here a sample:

#define UINT64 unsigned long long

...

#define index(x, y) (((x)%5)+5*((y)%5))

// NVCC Bug

//#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) << offset) ^ (((UINT64)a) >> (64-offset))) : a)

__device__ inline UINT64 ROL64(UINT64 a, unsigned int offset)

{

   const int _offset = offset;

   return ((offset != 0) ? ((a << _offset) ^ (a >> (64-offset))) : a);

}

__global__ void kernel()

{

   ...

   for(i = 0, i < 23; i++)

   {

      ...

      A[x] = A[x] ^ C[x];

      C[x] = A[index(x, y)] ^ ((~A[index(x+1, y)]) & A[index(x+2, y)]);

      D[x] = ROL64(C[x], 1);

      ...

      // Constant Memory access

      ...

   }

}

The performance I obtained are extremely poor despite I didn’t make usage of Global Memory nor synchronizing primitives (threads are unrelated).

I tried also to remove modulo operations but the situation remained almost the same.

I’m benchmarking on a device with compute cap. 1.2, Could it be a problem with the type UINT64 (64 bit)?

tera · February 2, 2011, 11:30am

What are [font=“Courier New”]A[/font], [font=“Courier New”]C[/font] and [font=“Courier New”]D[/font]? If they are not in shared memory, they will be placed in local memory which is just as slow as global memory.

GiuseppeC · February 2, 2011, 11:33am

“A” is a local variable.

I knew that local variable were stored into registers and for this reason the access is extremely fast…

tera · February 2, 2011, 11:38am

Local arrays will only be stored in registers if all indices can be resolved to constants at compile time.

Loop unrolling can help to achieve this. Otherwise you’ll have to use shared memory (or contend with a slower kernel).

GiuseppeC · February 2, 2011, 11:51am

I didn’t know it… thank you.

However in my case it happens. I tried with a kernel completely unrolled and the performance were actually better, but not enough.

How can I verify where the local variables were stored?

What do you think of the UINT64 issue?

Thank you very much.

Giuseppe

GiuseppeC · February 2, 2011, 11:52am

I forgot to mention that A, B, C are array of UINT64…

tera · February 2, 2011, 12:22pm

Compile with [font=“Courier New”]–ptxas-options=-v[/font] and check that no local memory use is reported.

It might be that the compiler does not optimize as well for 64 bit variable as for 32 bit (apart from the obvious doubled register and instruction need). But memory issues should definitely be sorted out before looking at that.

GiuseppeC · February 2, 2011, 4:22pm

This is the output…

ptxas info    : Used 12 registers, 480+0 bytes lmem, 16+16 bytes smem, 292 bytes cmem[0], 12 bytes cmem[1]

I removed all the array index calculations but some local memory is still used.

I cannot figure out the reason.

avidday · February 2, 2011, 4:45pm

If the arrays are using 480 bytes of local memory, that would be equivalently 120 registers. I doubt the compiler would ever be able to accomodate that size array in register, the theoretical PTX limit is 128 registers per thread, in practice a few less than that.

GiuseppeC · February 2, 2011, 4:57pm

I agree, but why with this simple code

#typedef unsigned long long UINT64

__global__ void kernel(UINT64 *messages_d, UINT64 *state_d)

{

   UINT64 A[5];

// Absorbing

   A[0] = state_d[0] ^ messages_d[0];

state_d[0] = A[0];

}

I obtain this

ptxas info    : Compiling entry function '_Z6kernelPyS_' for 'sm_11'

ptxas info    : Used 6 registers, 40+0 bytes lmem, 16+16 bytes smem, 292 bytes cmem[0]

?

Topic		Replies	Views
Local variables and registers CUDA Programming and Performance	13	6196	March 23, 2010
Optimizing ptx CUDA Programming and Performance	10	9049	April 24, 2008
Strange performance behavior CUDA Programming and Performance	20	18387	May 11, 2011
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3202	June 3, 2015
Very long kernels resulting in unoptimized compilation CUDA Programming and Performance	2	466	March 10, 2023
Why is my kernel using (stack frame) Local Memory instead of just using Registers? CUDA Programming and Performance kernel	4	803	August 17, 2022
Changing one bit in mask slows down code significantly CUDA Programming and Performance	12	731	January 20, 2022
How to make nvcc place variables in register instead of local memory when there's clearly enough space? CUDA Programming and Performance	10	72	September 9, 2024
Can't get ptx to stop using local CUDA Programming and Performance	5	372	March 26, 2024
LMEM reported by compiler question CUDA Programming and Performance	0	697	November 27, 2013

Performance Issue

Related topics