Problem using LFSR random number generator in CUDA FORTRAN

I’ve been trying to port this random number generator into CUDA. It’s a linear feedback shift register RNG, which satisfies the park-miller minimum standard. The source code can be found here: http://www.physics.udel.edu/~bnikolic/teaching/phys660/F90/rlfsr113.f90

After seeding, the random numbers are generated like so:

      b  = ishft(ieor(ishft(z1,6),z1),-13)
      z1 = ieor(ishft(iand(z1,-2),18),b)

      b  = ishft(ieor(ishft(z2,2),z2),-27)
      z2 = ieor(ishft(iand(z2,-8),2),b)

      b  = ishft(ieor(ishft(z3,13),z3),-21)
      z3 = ieor(ishft(iand(z3,-16),7),b)

      b  = ishft(ieor(ishft(z4,3),z4),-12)
      z4 = ieor(ishft(iand(z4,-128),13),b)

      rand_num=ishft( ieor(ieor(ieor(z1,z2),z3),z4) , -1)*4.656612873077d-10

When this works correctly, I get a uniform distribution of random reals between 0 and 1. The problem is, when I port this random number generator into CUDA, I don’t get the same numbers, and I get negative numbers as well. I figured out that the problem was to do with the fact that in FORTRAN, when you do a negative bit shift to a negative 32-bit integer, it cycles back to the highest positive integer (e.g, ISHFT(-4,-1) = 2147483646
). But when I do the same on CUDA, the number stays negative (e.g, ISHFT(-4,-1) = -2).

Is there a way to set it so that CUDA gets the same results as the CPU? Is it something to do with unsigned integers perhaps?

Hi Tom,

But when I do the same on CUDA, the number stays negative (e.g, ISHFT(-4,-1) = -2).

Do you have an example I could try? When I do an ISHFT(-4,-1) I get the same answers on both the CPU and GPU.

  • Mat
$ cat testishft.f90
	module testshift
	  
        integer, allocatable, dimension(:) :: arr 
        integer, allocatable, dimension(:), device :: arrD

	contains
	
	attributes(global) subroutine testshft ()
        integer ix
        ix = (blockidx%x-1)*blockdim%x + threadidx%x
        arrD(ix) = ISHFT(-4,-1)
	end subroutine testshft

	end module testshift

        program foo
        use testshift

        integer i
        allocate(arr(32), arrD(32))
        call testshft<<<1,32>>>()
        arr=arrD 
        i = ISHFT(-4,-1)
        print *, i, arr(1)
	end
$ pgf90 testishft.f90 -Mcuda; a.out
   2147483646   2147483646

The problem occurs when I use a variable inside the ISHFT function. For example, if instead of ISHFT(-4,-1) I had ISHFT(d_num,-1), where d_num is a device integer set equal to 4, then I get the wrong answer. I’ve written I piece of code exactly like yours, except using ‘d_num’ in the kernel and ‘num’ in the host:

module testshift 

  integer, allocatable, dimension(:) :: arr 
  integer, allocatable, dimension(:), device :: arrD 
  integer, device :: d_num = -4
  
contains 

  attributes(global) subroutine testshft () 
    integer ix 
    ix = (blockidx%x-1)*blockdim%x + threadidx%x 
    arrD(ix) = ISHFT(d_num,-1)
  end subroutine testshft

end module testshift

program foo 
  use testshift 

  integer :: i, num = -4
  allocate(arr(32), arrD(32)) 
  call testshft<<<1,32>>>() 
  arr=arrD
  i = ISHFT(num,-1) 
  print *, i, arr(1)
end program foo

This is what I get after making those changes:

./a.out 
   2147483646           -2

Thanks for the help btw. Do you know how I could get around this problem?

Basically the problem can be boiled down to this simple problem on the GPU:


ISHFT(-4,-1) = 2147483646

integer :: var = -4
ISHFT(var,-1) = -2

The second one should be the same as the first one. So if there’s a way of solving this disparity, that would be great.
Thanks

What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something? Try inserting:

istat=cudathreadsynchronize()

before the print statement.

What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something?

That doesn’t happen to me. Anyway, I’ve tried running it as a single thread with <<1,1>> instead of <<<1,32>>>. This is the simpler version of the code which illustrates the problem:


module testshift 

  integer :: test            ! host 
  integer, device :: d_test  ! device
  
contains 

  attributes(global) subroutine testshft () 
    integer :: var
    var = -4
    d_test = ISHFT(var,-1)
  end subroutine testshft

end module testshift

program foo 
  use testshift 

  call testshft<<<1,1>>>() ! carry out ishft on gpu
  test = d_test            ! copy device result to host
  print *, test            ! print result
end program foo

Try replacing ISHFT(var,-1) with ISHFT(-4,-1) and you’ll get a different answer

I also found that if you declare ‘var’ as a parameter, then it works fine. But I want to use ‘var’ as a variable

Thanks Tom, I was able to recreate the issue. The generated CUDA C code is as follows:

extern "C" __global__ void testshft()
{
int var;
var = -4;
_testshift_16.m0 = var>>(1);
}

So, yes, the difference is due to signed versus unsigned int. I’ve sent a report to our engineers (TPR#17689).

I’ve been looking for a work around, but since Fortran doesn’t have a concept of unsigned integers, it difficult to give the compiler a clue that var should be unsigned. We’ll most likely need to wait for a fix from our compiler engineers.


What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something?

Hi Peter,

While kernel calls are asynchronous, data copies are not. Hence, the code is blocking on the data transfer so the print is fine. Your comment would be correct if the data transfer was not being used.

  • Mat

I’m not sure I understand. I’ve replaced var (set to -4) by -4 and I get the same result.

module testshift

  integer :: test(2)            ! host
  integer, device :: d_test(2)  ! device

contains
attributes(global) subroutine testshft ()
    integer :: var
    var = -4
    d_test(1) = ISHFT(var,-1)
    d_test(2) = ISHFT(-4,-1)
  end subroutine testshft

end module testshift

program foo
  use testshift

  call testshft<<<1,1>>>() ! carry out ishft on gpu
  test = d_test                    ! copy device result to host
  print '(z32)', test              ! print result
end program foo

I am printing in hex format, which makes more sense if we are talking bits. In both cases, I get the same, which is precisely what you would expect when you use 2s-complement arithmetic. My impression is that if you print out the integers in FORTRAN in i-format, you get the C unsigned integer result with 2^16 subtracted; for 64 bit integers, you have the analogous situation, except that the shift is by 2^32. For both C and FORTRAN, the bits do what they are supposed to do, which is the same in both cases.

In other words, as far as the random number generator is concerned, the interpretation of the integers as signed or unsigned does not matter until the very end when you convert you result to a floating point number between 0 and 1. I think it should all work, as proposed by L’Ecuyer (sp?).

I would really like to know if I’m wrong, which is not unlikely, because I seem to be contradicting Mat: I do not understand the issue he says he has been able to reproduce, because I fail to see the issue.

A related problem is that nvcc seems to do 64 bit bit manipluation, but cudafortran chokes ungracefully on ishift etc. applied to 64 bit integers. I wonder when that will be fixed.

I'm not sure I understand. I've replaced var (set to -4) by -4 and I get the same result.

I’ve just tried the code you wrote, and I still get different results.
When printing in hex format, I get this:

./a.out 
                        FFFFFFFE
                        7FFFFFFE

Which translates to -2 and 2147483646 when printing in standard format.

This appears to be a machine/compiler specific problem. Just so you know, I’m using pgi 10.8 and the latest cuda toolkit (v3.2). This machine is 64-bit and running openSUSE 11.3, and the gpu is a GeForce 9300 GE. I’ve also tested this on a newer gpu but I can’t remember what it’s called, I’ll post it as soon as I can find out.

It gets confusinger and confusinger. Same code compiled differently produces different results (shift.CUF is the program I posted this morning)

pgfortran -Mcuda shift.CUF -o shift
nigh@la 20:46tsang/sandbox$ ./shift
                        FFFFFFFE
                        7FFFFFFE

The first one is wrong; there is 1 where a 0 should have been shifted in from the left. The second one is correct.

pgfortran -ta=nvidia shift.CUF -o shift
nigh@la 20:47tsang/sandbox$ ./shift
                        7FFFFFFE
                        7FFFFFFE

Both are correct in this case. This may be the work around.

pgfortran -ta=nvidia shift.CUF -o shift

Thanks a lot, this fixed the problem! All I had to do was add the ‘-ta=nvidia’ compile flag.

Ack, actually the problem still occurs if you use an integer stored in the device memory, as I intend to do. I’ve added ‘integer, device :: d_var = -4’ to the code and when using d_var, I get the wrong result like before. Here’s the modified code:

module testshift 

  integer :: test(3)            ! host 
  integer, device :: d_test(3)  ! device 
  integer, device :: d_var = -4
  
contains 
attributes(global) subroutine testshft () 
    integer :: var 
    var = -4 
    d_test(1) = ISHFT(var,-1) 
    d_test(2) = ISHFT(-4,-1)
    d_test(3) = ISHFT(d_var,-1)
  end subroutine testshft 

end module testshift 

program foo 
  use testshift 

  call testshft<<<1,1>>>() ! carry out ishft on gpu 
  test = d_test                    ! copy device result to host 
  print '(z32)', test              ! print result 
  print *, test
end program foo

Now when I compile this problem like so:

pgfortran -ta=nvidia -Mcuda testishft4.f90

I get this output:

phrkaj@larch:~> ./a.out 
                        7FFFFFFE
                        7FFFFFFE
                        FFFFFFFE
   2147483646   2147483646           -2

The last number is the result of ISHFT(d_var,-1) where d_var is a device integer. Any thoughts on how to fix this problem (i.e, get 2147483646 instead of -2)?

Thanks

I stumbled upon the very same bug implementing a similar (possibly the same…) RNG. For me, the following workaround works:

VMdebug=ISHFT(VM1,-8)
!!!! Work around to patch difference between -Mcuda=cc13 and -emu
if(VMdebug < 0) VMdebug=VMdebug+2**24

I hope it will work for you too.

Cheers,

Benoit.

I’m a bit worried about doing that, isn’t it going to significantly slow down the RNG? I need to be able to generate good random numbers really quickly

In the two cases in which you get the correct result this simply happens because the compiler evaluates the shift at compile time. You can see the C code generated by the compiler by compiling with -Mcuda=keepgpu; then look for a file with extension gpu.

What seems to work is to define a local variable var1 in subroutine testshift, while you change d_test(3) = ISHFT(var,-1) to d_test(3) = ISHFT(var1,-1). That is probably also faster when you perform a couple of combined bit manipulation operations.

Hi Tom, Peter,

The reason why the example ‘works’ with “-ta=nvidia” is because this flag forces optimization level -O2. Using -O2 directly would have the same effect.

As Peter points out, for the construct “ISHFT(-4,-1)” will get evaluated on the host during compilation so results in a constant value 2147483646. For “ISHFT(var,-1)” at “-O2”, the compiler is able to evaluate that var is constant, hence can replace the expression with a constant at compile time.

Note that our compiler engineers do have a fix in place that should be available in the next release (11.3).

% cat ishift.f90 
module testshift

integer :: test ! host
integer, device :: d_test ! device

contains

attributes(global) subroutine testshft ()
integer :: var
var = -4
d_test = ISHFT(var,-1)
end subroutine testshft

end module testshift

program foo
use testshift

call testshft<<<1,1>>>() ! carry out ishft on gpu
test = d_test ! copy device result to host
print *, test ! print result
end program foo

% pgf90 -Mcuda=keepgpu ishift.f90   ! Using Pre-release 11.3 compiler
% a.out
   2147483646
% cat ishift.001.gpu
#include "cuda_runtime.h"
#include "pgi_cuda_runtime.h"
#include "ishift.001.h"
__device__ struct{
int m0;
}__align__(16) _testshift_16;
extern "C" __global__ void testshft()
{
int var;
var = -4;
_testshift_16.m0 = (unsigned int)var>>(1);    <<< Fix is to cast var to unsigned int
}

Thanks,
Mat

I found a workaround.

I fixed the problem by replacing ISHFT with IBITS whenever a negative integer is shifted to the right (negative bit shift). Specifically, I Replace ISHFT(var,-x) with IBITS(var,x,32-x)

This is the old random number generator:

      b  = ishft(ieor(ishft(z1,6),z1),-13) 
      z1 = ieor(ishft(iand(z1,-2),18),b) 

      b  = ishft(ieor(ishft(z2,2),z2),-27) 
      z2 = ieor(ishft(iand(z2,-8),2),b) 

      b  = ishft(ieor(ishft(z3,13),z3),-21) 
      z3 = ieor(ishft(iand(z3,-16),7),b) 

      b  = ishft(ieor(ishft(z4,3),z4),-12) 
      z4 = ieor(ishft(iand(z4,-128),13),b) 

      rand=ishft( ieor(ieor(ieor(z1,z2),z3),z4) , -1)*4.656612873077d-10

This is the new random number generator:

    b  = ibits(ieor(ishft(z1,6),z1),13,19)
    z1 = ieor(ishft(iand(z1,-2),18),b)

    b  = ibits(ieor(ishft(z2,2),z2),27,5)
    z2 = ieor(ishft(iand(z2,-8),2),b)

    b  = ibits(ieor(ishft(z3,13),z3),21,11)
    z3 = ieor(ishft(iand(z3,-16),7),b)

    b  = ibits(ieor(ishft(z4,3),z4),12,20)
    z4 = ieor(ishft(iand(z4,-128),13),b)
    
    rand=ibits(ieor(ieor(ieor(z1,z2),z3),z4),1,31)*AM

The definition of IBITS can be found here: IBITS (The GNU Fortran Compiler)

Thanks for the help

According to the release notes, the shift problem has been solved in version 11.3; see Release 2011 Bug Fixes and Enhancements | PGI (TPR 17689).