Problem using LFSR random number generator in CUDA FORTRAN

tom.rb.edwards1 · February 15, 2011, 3:14pm

I’ve been trying to port this random number generator into CUDA. It’s a linear feedback shift register RNG, which satisfies the park-miller minimum standard. The source code can be found here: http://www.physics.udel.edu/~bnikolic/teaching/phys660/F90/rlfsr113.f90

After seeding, the random numbers are generated like so:

      b  = ishft(ieor(ishft(z1,6),z1),-13)
      z1 = ieor(ishft(iand(z1,-2),18),b)

      b  = ishft(ieor(ishft(z2,2),z2),-27)
      z2 = ieor(ishft(iand(z2,-8),2),b)

      b  = ishft(ieor(ishft(z3,13),z3),-21)
      z3 = ieor(ishft(iand(z3,-16),7),b)

      b  = ishft(ieor(ishft(z4,3),z4),-12)
      z4 = ieor(ishft(iand(z4,-128),13),b)

      rand_num=ishft( ieor(ieor(ieor(z1,z2),z3),z4) , -1)*4.656612873077d-10

When this works correctly, I get a uniform distribution of random reals between 0 and 1. The problem is, when I port this random number generator into CUDA, I don’t get the same numbers, and I get negative numbers as well. I figured out that the problem was to do with the fact that in FORTRAN, when you do a negative bit shift to a negative 32-bit integer, it cycles back to the highest positive integer (e.g, ISHFT(-4,-1) = 2147483646
). But when I do the same on CUDA, the number stays negative (e.g, ISHFT(-4,-1) = -2).

Is there a way to set it so that CUDA gets the same results as the CPU? Is it something to do with unsigned integers perhaps?

MatColgrove · February 16, 2011, 7:31pm

Hi Tom,

But when I do the same on CUDA, the number stays negative (e.g, ISHFT(-4,-1) = -2).

Do you have an example I could try? When I do an ISHFT(-4,-1) I get the same answers on both the CPU and GPU.

Mat

$ cat testishft.f90
	module testshift
	  
        integer, allocatable, dimension(:) :: arr 
        integer, allocatable, dimension(:), device :: arrD

	contains
	
	attributes(global) subroutine testshft ()
        integer ix
        ix = (blockidx%x-1)*blockdim%x + threadidx%x
        arrD(ix) = ISHFT(-4,-1)
	end subroutine testshft

	end module testshift

        program foo
        use testshift

        integer i
        allocate(arr(32), arrD(32))
        call testshft<<<1,32>>>()
        arr=arrD 
        i = ISHFT(-4,-1)
        print *, i, arr(1)
	end
$ pgf90 testishft.f90 -Mcuda; a.out
   2147483646   2147483646

tom.rb.edwards1 · February 17, 2011, 11:54am

The problem occurs when I use a variable inside the ISHFT function. For example, if instead of ISHFT(-4,-1) I had ISHFT(d_num,-1), where d_num is a device integer set equal to 4, then I get the wrong answer. I’ve written I piece of code exactly like yours, except using ‘d_num’ in the kernel and ‘num’ in the host:

module testshift 

  integer, allocatable, dimension(:) :: arr 
  integer, allocatable, dimension(:), device :: arrD 
  integer, device :: d_num = -4
  
contains 

  attributes(global) subroutine testshft () 
    integer ix 
    ix = (blockidx%x-1)*blockdim%x + threadidx%x 
    arrD(ix) = ISHFT(d_num,-1)
  end subroutine testshft

end module testshift

program foo 
  use testshift 

  integer :: i, num = -4
  allocate(arr(32), arrD(32)) 
  call testshft<<<1,32>>>() 
  arr=arrD
  i = ISHFT(num,-1) 
  print *, i, arr(1)
end program foo

This is what I get after making those changes:

./a.out 
   2147483646           -2

Thanks for the help btw. Do you know how I could get around this problem?

tom.rb.edwards1 · February 17, 2011, 3:06pm

Basically the problem can be boiled down to this simple problem on the GPU:

ISHFT(-4,-1) = 2147483646

integer :: var = -4
ISHFT(var,-1) = -2

The second one should be the same as the first one. So if there’s a way of solving this disparity, that would be great.
Thanks

Peter_Nightingale · February 17, 2011, 3:16pm

What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something? Try inserting:

istat=cudathreadsynchronize()

before the print statement.

tom.rb.edwards1 · February 17, 2011, 3:27pm

What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something?

That doesn’t happen to me. Anyway, I’ve tried running it as a single thread with <<1,1>> instead of <<<1,32>>>. This is the simpler version of the code which illustrates the problem:

module testshift 

  integer :: test            ! host 
  integer, device :: d_test  ! device
  
contains 

  attributes(global) subroutine testshft () 
    integer :: var
    var = -4
    d_test = ISHFT(var,-1)
  end subroutine testshft

end module testshift

program foo 
  use testshift 

  call testshft<<<1,1>>>() ! carry out ishft on gpu
  test = d_test            ! copy device result to host
  print *, test            ! print result
end program foo

Try replacing ISHFT(var,-1) with ISHFT(-4,-1) and you’ll get a different answer

tom.rb.edwards1 · February 17, 2011, 4:38pm

I also found that if you declare ‘var’ as a parameter, then it works fine. But I want to use ‘var’ as a variable

MatColgrove · February 17, 2011, 5:54pm

Thanks Tom, I was able to recreate the issue. The generated CUDA C code is as follows:

extern "C" __global__ void testshft()
{
int var;
var = -4;
_testshift_16.m0 = var>>(1);
}

So, yes, the difference is due to signed versus unsigned int. I’ve sent a report to our engineers (TPR#17689).

I’ve been looking for a work around, but since Fortran doesn’t have a concept of unsigned integers, it difficult to give the compiler a clue that var should be unsigned. We’ll most likely need to wait for a fix from our compiler engineers.

What your program prints out seems to be undefined; the kernel launch is asynchronous and the program just goes full steam ahead and prints garbage. Or am I missing something?

Hi Peter,

While kernel calls are asynchronous, data copies are not. Hence, the code is blocking on the data transfer so the print is fine. Your comment would be correct if the data transfer was not being used.

Mat

Peter_Nightingale · February 18, 2011, 4:19pm

I’m not sure I understand. I’ve replaced var (set to -4) by -4 and I get the same result.

module testshift

  integer :: test(2)            ! host
  integer, device :: d_test(2)  ! device

contains
attributes(global) subroutine testshft ()
    integer :: var
    var = -4
    d_test(1) = ISHFT(var,-1)
    d_test(2) = ISHFT(-4,-1)
  end subroutine testshft

end module testshift

program foo
  use testshift

  call testshft<<<1,1>>>() ! carry out ishft on gpu
  test = d_test                    ! copy device result to host
  print '(z32)', test              ! print result
end program foo

I am printing in hex format, which makes more sense if we are talking bits. In both cases, I get the same, which is precisely what you would expect when you use 2s-complement arithmetic. My impression is that if you print out the integers in FORTRAN in i-format, you get the C unsigned integer result with 2^16 subtracted; for 64 bit integers, you have the analogous situation, except that the shift is by 2^32. For both C and FORTRAN, the bits do what they are supposed to do, which is the same in both cases.

In other words, as far as the random number generator is concerned, the interpretation of the integers as signed or unsigned does not matter until the very end when you convert you result to a floating point number between 0 and 1. I think it should all work, as proposed by L’Ecuyer (sp?).

I would really like to know if I’m wrong, which is not unlikely, because I seem to be contradicting Mat: I do not understand the issue he says he has been able to reproduce, because I fail to see the issue.

A related problem is that nvcc seems to do 64 bit bit manipluation, but cudafortran chokes ungracefully on ishift etc. applied to 64 bit integers. I wonder when that will be fixed.

tom.rb.edwards1 · February 18, 2011, 7:11pm

I'm not sure I understand. I've replaced var (set to -4) by -4 and I get the same result.

I’ve just tried the code you wrote, and I still get different results.
When printing in hex format, I get this:

./a.out 
                        FFFFFFFE
                        7FFFFFFE

Which translates to -2 and 2147483646 when printing in standard format.

This appears to be a machine/compiler specific problem. Just so you know, I’m using pgi 10.8 and the latest cuda toolkit (v3.2). This machine is 64-bit and running openSUSE 11.3, and the gpu is a GeForce 9300 GE. I’ve also tested this on a newer gpu but I can’t remember what it’s called, I’ll post it as soon as I can find out.

Peter_Nightingale · February 19, 2011, 2:26am

It gets confusinger and confusinger. Same code compiled differently produces different results (shift.CUF is the program I posted this morning)

pgfortran -Mcuda shift.CUF -o shift
nigh@la 20:46tsang/sandbox$ ./shift
                        FFFFFFFE
                        7FFFFFFE

The first one is wrong; there is 1 where a 0 should have been shifted in from the left. The second one is correct.

pgfortran -ta=nvidia shift.CUF -o shift
nigh@la 20:47tsang/sandbox$ ./shift
                        7FFFFFFE
                        7FFFFFFE

Both are correct in this case. This may be the work around.

tom.rb.edwards1 · February 20, 2011, 10:35am

pgfortran -ta=nvidia shift.CUF -o shift

Thanks a lot, this fixed the problem! All I had to do was add the ‘-ta=nvidia’ compile flag.

tom.rb.edwards1 · February 21, 2011, 10:52am

Ack, actually the problem still occurs if you use an integer stored in the device memory, as I intend to do. I’ve added ‘integer, device :: d_var = -4’ to the code and when using d_var, I get the wrong result like before. Here’s the modified code:

module testshift 

  integer :: test(3)            ! host 
  integer, device :: d_test(3)  ! device 
  integer, device :: d_var = -4
  
contains 
attributes(global) subroutine testshft () 
    integer :: var 
    var = -4 
    d_test(1) = ISHFT(var,-1) 
    d_test(2) = ISHFT(-4,-1)
    d_test(3) = ISHFT(d_var,-1)
  end subroutine testshft 

end module testshift 

program foo 
  use testshift 

  call testshft<<<1,1>>>() ! carry out ishft on gpu 
  test = d_test                    ! copy device result to host 
  print '(z32)', test              ! print result 
  print *, test
end program foo

Now when I compile this problem like so:

pgfortran -ta=nvidia -Mcuda testishft4.f90

I get this output:

phrkaj@larch:~> ./a.out 
                        7FFFFFFE
                        7FFFFFFE
                        FFFFFFFE
   2147483646   2147483646           -2

The last number is the result of ISHFT(d_var,-1) where d_var is a device integer. Any thoughts on how to fix this problem (i.e, get 2147483646 instead of -2)?

Thanks

benoit2 · February 21, 2011, 6:44pm

I stumbled upon the very same bug implementing a similar (possibly the same…) RNG. For me, the following workaround works:

VMdebug=ISHFT(VM1,-8)
!!!! Work around to patch difference between -Mcuda=cc13 and -emu
if(VMdebug < 0) VMdebug=VMdebug+2**24

I hope it will work for you too.

Cheers,

Benoit.

tom.rb.edwards1 · February 22, 2011, 10:40am

I’m a bit worried about doing that, isn’t it going to significantly slow down the RNG? I need to be able to generate good random numbers really quickly

Peter_Nightingale · February 22, 2011, 1:47pm

In the two cases in which you get the correct result this simply happens because the compiler evaluates the shift at compile time. You can see the C code generated by the compiler by compiling with -Mcuda=keepgpu; then look for a file with extension gpu.

What seems to work is to define a local variable var1 in subroutine testshift, while you change d_test(3) = ISHFT(var,-1) to d_test(3) = ISHFT(var1,-1). That is probably also faster when you perform a couple of combined bit manipulation operations.

MatColgrove · February 22, 2011, 5:05pm

Hi Tom, Peter,

The reason why the example ‘works’ with “-ta=nvidia” is because this flag forces optimization level -O2. Using -O2 directly would have the same effect.

As Peter points out, for the construct “ISHFT(-4,-1)” will get evaluated on the host during compilation so results in a constant value 2147483646. For “ISHFT(var,-1)” at “-O2”, the compiler is able to evaluate that var is constant, hence can replace the expression with a constant at compile time.

Note that our compiler engineers do have a fix in place that should be available in the next release (11.3).

% cat ishift.f90 
module testshift

integer :: test ! host
integer, device :: d_test ! device

contains

attributes(global) subroutine testshft ()
integer :: var
var = -4
d_test = ISHFT(var,-1)
end subroutine testshft

end module testshift

program foo
use testshift

call testshft<<<1,1>>>() ! carry out ishft on gpu
test = d_test ! copy device result to host
print *, test ! print result
end program foo

% pgf90 -Mcuda=keepgpu ishift.f90   ! Using Pre-release 11.3 compiler
% a.out
   2147483646
% cat ishift.001.gpu
#include "cuda_runtime.h"
#include "pgi_cuda_runtime.h"
#include "ishift.001.h"
__device__ struct{
int m0;
}__align__(16) _testshift_16;
extern "C" __global__ void testshft()
{
int var;
var = -4;
_testshift_16.m0 = (unsigned int)var>>(1);    <<< Fix is to cast var to unsigned int
}

Thanks,
Mat

tom.rb.edwards1 · March 21, 2011, 10:52am

I found a workaround.

I fixed the problem by replacing ISHFT with IBITS whenever a negative integer is shifted to the right (negative bit shift). Specifically, I Replace ISHFT(var,-x) with IBITS(var,x,32-x)

This is the old random number generator:

      b  = ishft(ieor(ishft(z1,6),z1),-13) 
      z1 = ieor(ishft(iand(z1,-2),18),b) 

      b  = ishft(ieor(ishft(z2,2),z2),-27) 
      z2 = ieor(ishft(iand(z2,-8),2),b) 

      b  = ishft(ieor(ishft(z3,13),z3),-21) 
      z3 = ieor(ishft(iand(z3,-16),7),b) 

      b  = ishft(ieor(ishft(z4,3),z4),-12) 
      z4 = ieor(ishft(iand(z4,-128),13),b) 

      rand=ishft( ieor(ieor(ieor(z1,z2),z3),z4) , -1)*4.656612873077d-10

This is the new random number generator:

    b  = ibits(ieor(ishft(z1,6),z1),13,19)
    z1 = ieor(ishft(iand(z1,-2),18),b)

    b  = ibits(ieor(ishft(z2,2),z2),27,5)
    z2 = ieor(ishft(iand(z2,-8),2),b)

    b  = ibits(ieor(ishft(z3,13),z3),21,11)
    z3 = ieor(ishft(iand(z3,-16),7),b)

    b  = ibits(ieor(ishft(z4,3),z4),12,20)
    z4 = ieor(ishft(iand(z4,-128),13),b)
    
    rand=ibits(ieor(ieor(ieor(z1,z2),z3),z4),1,31)*AM

The definition of IBITS can be found here: IBITS (The GNU Fortran Compiler)

Thanks for the help

Peter_Nightingale · March 21, 2011, 6:02pm

According to the release notes, the shift problem has been solved in version 11.3; see Release 2011 Bug Fixes and Enhancements | PGI (TPR 17689).

Topic		Replies	Views
Different results with -Mcuda=emu / -Mcuda with simple code Legacy PGI Compilers	17	15311	December 10, 2009
compiler bug? bit shift CUDA Programming and Performance	7	4553	November 28, 2008
Erronous ptx generated CUDA Programming and Performance	0	1572	June 13, 2007
atomicadd for double precision in CUDA Fortran Legacy PGI Compilers	20	21662	November 15, 2013
random numbers inside the Kernel CUDA Programming and Performance	31	60921	November 26, 2011
Another Random Number Thread CUDA Programming and Performance	15	3391	August 13, 2009
Using -i8 flag with CUDA Fortran code Legacy PGI Compilers	4	7028	November 15, 2013
What's a good random number generator? CUDA Programming and Performance	21	13203	May 6, 2009
random number generation generating random numbers in CUDA CUDA Programming and Performance	64	109779	January 25, 2011
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21335	February 26, 2010

Problem using LFSR random number generator in CUDA FORTRAN

Related topics