Error when call curand_init() with the even-th elements of a philox type array in NVHPC 22.3

The errors is below

FAILED: 719(unspecified launch failure)
0: DEALLOCATE: misaligned address

And I use cuda-memcheck to run the file and find that errors are occur when threadid%x is odd(so they are even in Fortran). Then I try below code to check it.

module mphilox
  use curand_device
  integer, parameter :: n = 6
  type(curandStatePhilox4_32_10), device :: h(n)
  real, device :: a(n), b(n)
contains
attributes(global) subroutine curandinit()
  integer(8) :: seed, seq, offset
  integer :: iam
  iam = (blockidx%x-1)*blockdim%x+threadidx%x
  seed = 12345
  if (mod(iam,2) == 0) return
!  call curand_init(seed, iam-1, offset, h(iam-1))
  call curand_init(seed, iam, offset, h(iam))
!  call curand_init(seed, iam-1, offset, h(iam-1))
  h(ian-1) = h(iam)
end subroutine
attributes(global) subroutine curandphilox()
  integer :: iam
  iam = (blockidx%x-1)*blockdim%x+threadidx%x
  a(iam) = curand_uniform(h(iam))
  b(iam) = curand_normal(h(iam))
  !b(iam:iam+1) = curand_norm2(h(iam))
  return
end subroutine

end module mphilox

program philox
  use mphilox
  a = 0.0; b = 0.0
  call curandinit<<<2,3>>> ()
  call curandphilox<<<2,3>>> ()
end

Change line 12-15

!  if (mod(iam,2) == 0) return
  call curand_init(seed, iam, offset, h(iam-1))
  call curand_init(seed, iam, offset, h(iam))
!  call curand_init(seed, iam, offset, h(iam-1))

Errors occur in “by thread (0,0,0) in block (0,0,0); by thread (2,0,0) in block (0,0,0); by thread (1,0,0) in block (1,0,0)”

And if

!  if (mod(iam,2) == 0) return
!  call curand_init(seed, iam-1, offset, h(iam-1))
  call curand_init(seed, iam, offset, h(iam))
  call curand_init(seed, iam-1, offset, h(iam-1))

Errors occur in “by thread (1,0,0) in block (0,0,0); by thread (0,0,0) in block (1,0,0); by thread (2,0,0) in block (1,0,0)”

So, I guess curand_init(seed,seq,offset, h(iam)) error when iam is even.
One of cuda-memcheck message is below

========= Invalid __global__ write of size 16
=========     at 0x000000c0 in mphilox_curandinit_
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f9616c0fcd8 is misaligned
=========     Device Frame:mphilox_curandinit_ (mphilox_curandinit_ : 0xc0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2e46de]
=========     Host Frame:/home/zirui/soft/NVHPC/Linux_x86_64/22.3/cuda/11.0/lib64/libcudart.so.11.0 [0xf62b]
=========     Host Frame:/home/zirui/soft/NVHPC/Linux_x86_64/22.3/cuda/11.0/lib64/libcudart.so.11.0 (cudaLaunchKernel + 0x1c1) [0x4f5b1]
=========     Host Frame:/home/zirui/soft/NVHPC/Linux_x86_64/22.3/compilers/lib/libcudafor.so (__pgiLaunchKernel + 0x1a6) [0x1202c]
=========     Host Frame:./philox.out [0x1511]
=========     Host Frame:./philox.out [0x11b3]
=========     Host Frame:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./philox.out [0x1029]

NVHPC version and command are below

nvfortran --version
nvfortran 22.3-0 64-bit target on x86-64 Linux -tp haswell

nvfortran -cuda -cudalib=curand -o philox.out curand_philox_test.f90 && cuda-memcheck ./philox.out

By the way, I’d like to use curand_norm2(), but there is also error about it

NVFORTRAN-S-0155-Calls from device code to a host function are allowed only in emulation mode - curand_norm2 (curand_philox_test.f90: 23)

But it is a device api in cuRand Document.
How should I call this function? Thanks

There isn’t this problem in NVHPC 22.2. But 22.3 is needed for me for When seq and offset are not zero, the distributions curand given are error - Accelerated Computing / GPU-Accelerated Libraries - NVIDIA Developer Forums

I am by no means a Fortran expert, but are you mixing 1-based and 0-based indexing here?

iam = (blockidx%x-1)*blockdim%x+threadidx%x

For thread (0,0,0) in block (0,0,0), isn’t iam == 1?

Then here you are trying to write to index 0 for your RNG state?

call curand_init(seed, iam, offset, h(iam-1))

Lastly, adding --lineinfo to your compile instruction will tell you the exact line your error is occuring.

Thanks for your suggesstions.
About --lineinfo, nvfortran cannot complier with the flag
nvfortran-Error-Unknown switch: --lineinfo

And about 1-base and 0-base.
Fortran is 1-base. And the code before doesn’t have iam-1.
iam-1 here is just used for check the error occur in odd-th or even-th. Change it to iam+1 and state the array of n+1 elements are correct.

Use -gpu=lineinfo for nvfortran.
Here’s my version, which I think has the indexing correct:

module mphilox
use curand_device
integer, parameter :: n = 6
type(curandStatePhilox4_32_10), device :: h(n)
real, device :: a(n), b(n)
contains
attributes(global) subroutine curandinit()
integer(8) :: seed, seq, offset
integer :: iam
iam = (blockidx%x-1)*blockdim%x+threadidx%x
seed = 12345_8
offset = 0_8
if (mod(iam,2) == 0) return
call curand_init(seed, iam, offset, h(iam))
h(iam+1) = h(iam)
end subroutine
attributes(global) subroutine curandphilox()
integer :: iam
iam = (blockidx%x-1)*blockdim%x+threadidx%x
a(iam) = curand_uniform(h(iam))
b(iam) = curand_normal(h(iam))
return
end subroutine

But, I think you have a race condition if more than one thread tries to use the same random state variable.

Actually, I guess you make a copy of it, it is not shared between threads…

Thanks for code before. But it still has some question, such like h(1) don’t have been initialed. and h(6+1) is over the bound.

I make the copy just because if I don’t use if (mod(iam,2) == 0) return. it has the error.
So I’d like to know Why it has error when I call call curand_init(seed, iam, offset, h(even number))

Oh, I see. call curand_init(seed, iam, offset, h(iam))

Does it only occur with curandStatePhilox4_32_10? I wonder if the size of that struct we are using in Fortran is wrong, or if it has changed recently.

Firstly, I guess it occurs when threadidx%x is even. But iam-1 help me determine the error occurs when h(even number)

What happens if you make n=8?

curandStateMRG32k3a and curandStateXORWOW is OK.
It really changed recently for my another topic.
When seq and offset are not zero, the distributions curand given are error - Accelerated Computing / GPU-Accelerated Libraries - NVIDIA Developer Forums

I first use 10000*10000 to check random number.
Here I just reduce the size of it.

Looks like we fixed one error but introduced another. I think it has to do with the alignment within the struct. In C, it is like this:

struct curandStatePhilox4_32_10 {
uint4 ctr;
uint4 output;
uint2 key;
unsigned int STATE;
int boxmuller_flag;
int boxmuller_flag_double;
float boxmuller_extra;
double boxmuller_extra_double;
};

I will dig into it, but my guess is the new way we are building the Fortran is not handling the alignment of uint2 correctly.

Thanks. And I have another question that Why I could not call curand_normal2() in device code. the error is

NVFORTRAN-S-0155-Calls from device code to a host function are allowed only in emulation mode - curand_normal2 (curand_philox_test.f90: 23)
__device__​ float2 curand_normal2 ( curandStatePhilox4_32_10_t* state )

Yes, I am looking into that too.