64 bit integer and bank conflicts

Is there a clear way to work with linear array (or matrices) of unsigned long long without incurring in bank conflicts?There is a solution on the programming guide which deals with double type but it doesn’t seem a good method and moreover I can’t find the equivalent of __hiloint2double for integer.
Thanks for help.

int64s don’t really give bank conflicts. Technically “yes”, but in reality, it’s what you’d expect anyway. If you have a 16-thread half-warp read 16x64-bit, it’s going to need 2 cycles vs a half-warp that reads 16x32-bit and needs 1. That’s it.

EDIT: Oh, no, I see, it will take 4 cycles because all 16 threads try to read the low word with a stride of 8, then the high word. Hmm. Well… a __hiloint2double for integers is simple. It can be defined as:

define __hiloint2longlong(hi,lo) (((long long)(hi)<<32) | (lo))

EDIT 2: But, the extra two cycles this will take will negate the performance improvement. (Perhaps there’s a way to do it efficiently from PTX, but that’s a lost cause.)

If you check headers, you will find in “sm_13_double_functions.h” definition:

__device_func__(double __hiloint2double( int a, int b ))

{

  volatile union {

	double	 d;

	signed int i[2];

  } cvt;

  cvt.i[0] = b;

  cvt.i[1] = a;

  return cvt.d;

}

From that it should be easy to make something similar, for example:

__device_func__(long long __hiloint2longlong(int a, int b))

{

  volatile union {

	long long	 d;

	signed int i[2];

  } cvt;

  cvt.i[0] = b;

  cvt.i[1] = a;

  return cvt.d;

}

or even better, use it inline :

// split representation of  __shared__ long long shared[32];

__shared__ long shared_lo[32];

__shared__ long shared_hi[32];

//I didnt test this, but its possible that this could save on indexed cvt.i[0], IF struct is permitted in volatile

// otherwise just use as above  "long i[2];"

struct Tmy64{

  long iLo;

  long iHi;

}

// in your kernel

//...

long long sum=0;

volatile union {

  long long	 d;

  struct Tmy64 i64;

} myL;

for (int idx=0; idx<32; idx++){

  // fill your long long

  myL.i64.iLo=shared_lo[idx];

  myL.i64.iHi=shared_hi[idx];

  // now directly use that long long in your computation, saving in return and another assigment (also saving on f-on call)

  sum+= myL.d;

}

EDIT:

BTW, I wonder how they ensure that compiler will put integer array in registers in their __hiloint2double function? Because if cvt.i ends up in local memory, there is no point in doing this, since it will be slower than just reading 8-byte values . Even if it put that volatile union variable into shared memory (which so far I didnt see compiler doing on its own), it would not be faster since it would again need to read 8 bytes from shared memory (especially considering function call/return overhead, since I dont see any “inline” directive).

And if it DOES put integer array cvt.i[2] into registers, than maybe we can use that trick for regular short arrays, since that is something I miss mosly. So instead of (not-existing):

__register__ int a[8]; // this __register__ does not exist now

  for ( int i=0; i<8; i++) a[i]=i;

maybe it could be used

volatile union { int a[8]; } ua;

  for ( int i=0; i<8; i++) ua.a[i]=i;

… but I doubt it will be so easy - my guess would be that (if it works with registers at all) compiler allows only limited array lengths and static indexing. Or even more probably, it does not work with registers, and using __hiloint2double will not be faster than just reading 8-byte values. But this is all guesswork at my side, since I still didnt have time to check if above things work in code.

You can fit arrays into registers if you do static accesses, as you say, and also by setting -maxrregcount parameter to nvcc to a high number. Works fairly consistently (well… as consistently as #pragma unroll). But it would be very nice to have register that would make the compiler more aggressive in giving you what you want.

I too am very curious if __hiloint2double() is optimized. I was looking at the PTX manual very briefly, and saw no way to access the high or low word of a 64bit register individually. (Oh, actually, I saw that unions are fundamental types on a PTX level so might be used to split a register, but they aren’t implemented in the current version.)

Btw, someone else here ran into a bug accessing the individual words of a ‘long long’. The compiler was mishandling taking the address of a long long variable, not realizing it was in a register, and producing garbage when you dereferenced the pointer by any non-zero offset (the user wanted to offset by 4 and dereference it as an int). Long longs have not been thoroughly tested in CUDA, so watch out for that.