atomicadd for double precision in CUDA Fortran

atomicadd support integer type only in cuda pgfortran.
But it is important to me in coding on a Monte Carlo code.

How to realize it? Anyone has tried?

  1. I try to use CUDA C (which support atomicadd for float values now) as extern function, but “PGF90-S-0155-Calls from device code to a host function are allowed only in emulation mode - atomicdadd”

  2. Also atomicexch in pgfortran do not support float values, which are used in old version of CUDA C.

I almost finish the code implantation for GPU (1000+ lines), but it block me.

Thanks in advance. Hope pgfortran work fast for cuda!

gfwang

Hi gfwang,

I added a feature request (TPR#17778). Hopefully we can add them soon.

  • Mat

… …

I hope I could wait …

NOTE: your trial license will expire in 9 days, 6.72 hours.

The trail license could be refilled?

My original plan is to advise the lab to buy it with this demonstration code,
though we really have so many compilers – pathscale, ifort …

Is there anyway to mimic this function?

The trail license could be refilled?

Yes. Send a note to Sales (sales@pgroup.com), they can reset trial licenses or extend a longer term Demo license.

we really have so many compilers

Every compiler has it’s pluses and minus. Having multiple vendors ensures you have the best option for the particular need at hand. PGI is both strong in CPU performance and has taken the lead in GPU programing. Being independent from the architectures we target allows us to work on the current state of the art HPC architectures, whether that be AMD, Intel, NVIDIA, or what ever the future holds.

Is there anyway to mimic this function?

I don’t think so, but I’ll double check.

  • Mat

Thanks a lot!
I’m considering to change the algorithm, let CPU do this job. Atomic operation may be expensive.

Every compiler has it’s pluses and minus. Having multiple vendors ensures you have the best option for the particular need at hand. PGI is both strong in CPU performance and has taken the lead in GPU programing. Being independent from the architectures we target allows us to work on the current state of the art HPC architectures, whether that be AMD, Intel, NVIDIA, or what ever the future holds.

Yes. PGFORTRAN is the only compiler supporting cuda nowaday. And lots scientific computation codes are written in fortran.


I hope PGFORTRAN could develop supportings for CUDA faster.
For example:

  1. support multi Modules for global & device subroutines

  2. minor request – support character strings in emu mode.
    It would be interesting for me to do handy debug.

In emu mode (-Mcuda=emu), I/O are supported by pgfortran:
PGF90-W-0155-I/O statements allowed in device routines only in emulation mode
But it do not support character strings also:
PGF90-S-0155-CUDA device routines do not support character strings with length > 1

  1. support multi Modules for global & device subroutines

This is our most requested feature. The problem is that there isn’t a linker for device code so no way to statically associate symbols from multiple objects. Secondly, CUDA doesn’t support true function calling on the device. Currently, all called device routine must be inlined at compile time.

These are difficult challenges but we are working on them. Dr. Michael Wolfe in this article (Account Login | PGI) discusses these challenges, some of our solutions and future directions.

  1. minor request – support character strings in emu mode.

I added a feature request (TPR#17781).

  • Mat

Secondly, CUDA doesn’t support true function calling on the device. Currently, all called device routine must be inlined at compile time.

Ah, no stacks for function calling in CUDA? it’s really hard jobs …
It’s really nice to work together with a compiler expert.


My questions go on.

In emu mode (-Mcuda=emu -Mbounds), My code works very well.
But in cuda , failed with prompts:
copyout Memcpy (host=0x7c67e0, dev=0x1e620000, size=131072) FAILED: 4(unspecified launch failure)

What is the most possible reason for this error?
What’s difference between emu and real CUDA?

How to stop the compiler optimize on kernel code ? -O0 can’t stop all.

Hi tlstar,

Your kernel is most likely seg faulting due to an out-of-bounds error when accessing energy_inter. I’m guessing that you have more threads than elements in the array.

To fix, add a guard before accessing the array. i.e. “IF (i .LE. ATOMIC_RAYS) then”

Also, your launch configuration could e better. I would suggest having the number of blocks being variable and threads being fixed. The number of blocks is maxed at 64k while the number threads is 512 or 1024 depending on your device. Your current config will break as ATOMIC_RAYS becomes large. Something like the following would be better:

CALL raycast<<<(ATOMIC_RAYS+GPU_CORES-1)/GPU_CORES, GPU_CORES>>>(point_dev, cell_dev, simul_dev,energy_inter_dev)

Hope this helps,
Mat

Thanks for the most prompting reply.

The difference I found for comment the results output line, may due to the “-fast” option to optimization by compiler.

How to stop the compiler optimize on kernel code ? -O0 can’t stop all.

And what’s difference between emu and real CUDA?

Hi tlstar,

The difference I found for comment the results output line, may due to the “-fast” option to optimization by compiler.

Possible, but given the code you had posted earlier, it’s more likely a programing error. I would need a reproducing example to be sure. Please feel free to send the code to PGI Customer Service (trs@pgroup.com) and ask them to send it to me.

And what’s difference between emu and real CUDA?

Emulation mode (-Mcuda=emu) generates a CPU version of the code that uses OpenMP Tasks to simulate a NVIDIA device. It’s best used for debugging since the PGI debugger (pgdbg) is OpenMP capable.

Though, since it is running on the CPU, there still can be differences than running on the GPU. For example, on a CPU if you write beyond the end of an array, the code most likely wont seg fault. You may stomp over another variable’s data and cause other problems, but not seg fault. On the GPU, accessing memory even one element beyond the end of an array will trigger a seg fault. Adding array bounds checking (-Mbounds) in emulation mode should help find these errors.

  • Mat

Hi Mat,

Thanks a lot.

Now, I manage to make it works. several efforts are made:

  1. change some expressions in kernel code, like
  
        point_in(:) = 0.99D0 * A_dev(:, point) &
          + 0.01D0/4.0D0 * (A_dev(:,B_dev(1,cell)) + A_dev(:,B_dev(2,cell)) + A_dev(:,B_dev(3,cell)) + A_dev(:,B_dev(4,cell)))

Change into

        point1 = B_dev(1,cell)
        point2 = B_dev(2,cell)
        point3 = B_dev(3,cell)
        point4 = B_dev(4,cell)
        point_in(:) = 0.99D0 * A_dev(:, point) &
          + 0.01D0/4.0D0 * (A_dev(:,point1) + A_dev(:,point2) + A_dev(:,point3) + A_dev(:,point4))

A_dev are DOUBLE PRECISION, device arrays; B_dev are INTEGER, device arrays.

  1. stop the optimization of pgfortran link in makefile
F90     = pgfortran -module $(MODULE) -Mmpi=mpich1 -Mcuda -Mbyteswapio

 OPT     = -O3 -tp nehalem-64
 GPUOPT  = -O3 -ta=nvidia:cuda3.2 -Mcuda=keepbin -Mcuda=keepptx -Mcuda=ptxinfo -v
 LN_OPT  = -O0
#
# compilation rules
#
.SUFFIXES : .CUF .c .f .f90 .mod .F90
.mod.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.CUF.o :
    $(F90) $(GPUOPT) $(INCLUDES) -o $*.o -c $*.CUF
.f90.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.F90.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.F90
.c.o :
    $(CC)  $(INCLUDES) -o $*.o -c $*.c
#   $(CC)  -O3 $(INCLUDES) -ffast-math -o $*.o -c $*.c
#  compilation
#
$(TARGET) : $(OBJECTS_F90) $(OBJECTS_C)
    $(F90) $(LN_OPT) $(OBJECTS_C) $(OBJECTS_F90) -o $@
#

If I change it with LN_OPT = -O3 (linking parameter), the run is stopped with the same error. Maybe there is a bug in the pgfortran link (strange). I try to provide my codes to you for test.

  1. change the block & grid parameters to run kernel
CALL raycast<<<ATOMIC_RAYS/BLOCK_SIZE,BLOCK_SIZE>>>(point_dev, cell_dev, simul_dev,energy_inter_dev)

defined as

INTEGER, PARAMETER            ::  BLOCK_SIZE=128
INTEGER, parameter              :: ATOMIC_RAYS = 14*BLOCK_SIZE

But if I increase ATOMIC_RAYS into a larger value, for example 214BLOCKSIZE, the code stopped with the same error.
I do not quite know about these two parameters.
BLOCK_SIZE should be 32, 64, 96, 128, 256 … 1024? But also limited by registers?

In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value? And what is the lmem, smem, do they limit the total thread number?

....
ptxas info    : Compiling entry function 'raycast' for 'sm_13'
ptxas info    : Used 122 registers, 168+0 bytes lmem, 40+16 bytes smem, 2768 bytes cmem[0], 120 bytes cmem[1], 4 bytes cmem[14]
....
ptxas info    : Compiling entry function 'raycast' for 'sm_20'
ptxas info    : Used 63 registers, 8+0 bytes lmem, 72 bytes cmem[0], 2768 bytes cmem[2], 4 bytes cmem[14], 40 bytes cmem[16]
  0 inform,   0 warnings,   0 severes, 0 fatal for ..cuda_fortran_constructor_1
PGF90/x86-64 Linux 11.3-0: compilation successful

And my GPU:

 One CUDA device found

 Device Number: 0
   Device name: Tesla M2050
   Compute Capability: 2.0
   Number of Multiprocessors: 14
   Number of Cores: 448
   Max Clock Rate (kHz): 1147000
   Warpsize: 32

    Execution Configuration Limits
      Maximum Grid Dimensions: 65535 x 65535 x 1
      Maximum Block Dimensions: 1024 x 1024 x 64
      Maximum Threads per Block: 1024

    Off-Chip Memory
      Total Global Memory (B): 2817982464
      Total Constant Memory (B): 65536
      Maximum Memory Pitch for Copies (B): 2147483647
      Integrated: No 

    On-Chip Memory
      Shared Memory per Multiprocessor (B): 49152
      Number of Registers per Multiprocessor: 32768

How to setup best blocksize and grid parameter to call kernel? If we have sufficient loads to execute.


Possible, but given the code you had posted earlier, it’s more likely a programing error. I would need a reproducing example to be sure. Please feel free to send the code to PGI Customer Service (> trs@pgroup.com> ) and ask them to send it to me.

Thanks a lot. It’s great to me, if you can help me to review the code.
I would try, though the program has a big input files (200M) to run.



Though, since it is running on the CPU, there still can be differences than running on the GPU. For example, on a CPU if you write beyond the end of an array, the code most likely wont seg fault. You may stomp over another variable’s data and cause other problems, but not seg fault. On the GPU, accessing memory even one element beyond the end of an array will trigger a seg fault. Adding array bounds checking (-Mbounds) in emulation mode should help find these errors.

I have used -Mbounds, since I read your posts (long-time ago) in the forum. It seems to be not a array bounds problem, at least not a simple bounds checking issue.

In emu mode, it works and gets the exact results as the CPU version. With the efforts mentioned above, it works in CUDA. But the results is not right.
Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.

Thanks again for your help. And thanks in advance for your comments.

Hi tlstar,

Maybe there is a bug in the pgfortran link (strange). I try to provide my codes to you for test.

Possible, but without a reproducible example I can’t tell. Though, you just sent in the code to PGI Customer Service, so I’ll take a look.

But if I increase ATOMIC_RAYS into a larger value, for example 214BLOCKSIZE, the code stopped with the same error.

You can have 64K blocks so 28 should be fine. If anything it’s too small.

BLOCK_SIZE should be 32, 64, 96, 128, 256 … 1024? But also limited by registers?

The block size will depend upon many factors. Registers are one factor, shared memory use, your algorithm, and the device being used are others.

In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value?

The compiler is generating two separate device binaries. One targeting devices with compute capability 1.3 and the other targeting 2.0 devices. The 122 registers is for CC1.3 devices and 63 is for CC2.0. -Mcuda=ccXX flags allow you specify the CC binary to generate is you don’t need both targets.

Note that currently your code can use up to 16384 registers per block. So having a block size of 128 threads and 122 registers being used per thread, you are using 15616 registers per block. Hence, you most likely don’t want to use use a larger block size, unless you’re using a Fermi (CC2.0) card.

And what is the lmem, smem, do they limit the total thread number?

Local memory and Shared Memory per thread. ‘cmem’ is constant memory. The output from ‘pgaccelinfo’ will list the maximums.

How to setup best blocksize and grid parameter to call kernel?

Dr. Michael Wolfe likes to joke that there is PHd topic here if you can find a generic solution to finding the optimal schedule. Right now the state of the art solution is to try all possible combinations, which of course, is impractical. Instead it’s more of an art that comes from experience and trial and error. You need base it upon what’s best for your algorithm, how your memory is laid out, available device resources, etc.

Your kernel is fairly simple, thread-wise, so should be able to just try a few BLOCK_SIZE to find what gives you the best performance.

have used -Mbounds, since I read your posts (long-time ago) in the forum. It seems to be not a array bounds problem, at least not a simple bounds checking issue.

It the first thing to check since it’s a common error, especially when it “works” in emulation but fails on the GPU.

Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.

I’ll need to look at the code, but yes it could be a problem. It sounds like you may be creating a dependency.

  • Mat

Done!

Please help me to have a test on ATOMIC_RAYS=214BLOCKSIZE. It will lead to a error :
0: copyout Memcpy (host=0x7c9a70, dev=0x1e720000, size=28672) FAILED: 4(unspecified launch failure)
Why? While it works on ATOMIC_RAYS=14*BLOCKSIZE.

It works while BLOCK_SIZE=64 & ATOMIC_RAYS = 214BLOCK_SIZE; while fails with BLOCK_SIZE=64 & ATOMIC_RAYS = 414BLOCK_SIZE. Seems the total threads number is limited.

In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value?The compiler is generating two separate device binaries. One targeting devices with compute capability 1.3 and the other targeting 2.0 devices. The 122 registers is for CC1.3 devices and 63 is for CC2.0. -Mcuda=ccXX flags allow you specify the CC binary to generate is you don’t need both targets.

Note that currently your code can use up to 16384 registers per block. So having a block size of 128 threads and 122 registers being used per thread, you are using 15616 registers per block. Hence, you most likely don’t want to use use a larger block size, unless you’re using a Fermi (CC2.0) card.

I’m using Tesla M2050. So CC2.0 is loaded automatically?
Device Number: 0
Device name: Tesla M2050
Compute Capability: 2.0
Number of Multiprocessors: 14
Number of Cores: 448
Max Clock Rate (kHz): 1147000
Warpsize: 32

And what is the lmem, smem, do they limit the total thread number?Local memory and Shared Memory per thread. ‘cmem’ is constant memory. The output from ‘pgaccelinfo’ will list the maximums.

It would be better, if the pgfortran can allow “INTEGER, CONSTANT, ALLOCATABLE, DIMENSION(:) ::” type. In my code, lots of data are only readable for the kernel codes and written by host. But the length is depending on the input files. CONSTANT variables will be visited faster?

Your kernel is fairly simple, thread-wise, so should be able to just try a few BLOCK_SIZE to find what gives you the best performance.

Thanks. I hope I could get a good speedup. However, quite a little bit more point-visiting on the device memory are involved. Could you comment on the optimization?

Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.I’ll need to look at the code, but yes it could be a problem. It sounds like you may be creating a dependency.

Maybe. Notice I store ATOMIC_RAYS number of random seeds, generate by the fortran RAND_NUMBER(). I think this instrics function are based on bit-shifting method, not the same as the function I implemented. If you have a better way for RNG (low cost and less seeds), please recommend.

Thanks!

Done!

I think that the most possible reason for different results from emulation (run in series of threads) and CUDA release may due to the conflicting on the write device memory.

But I only have two write operations:

  1.   energy_inter(id_thread) = energy_per_realisation
    
  2.   CALL aleatoire_save_GPU(id_thread,randseed)
    

And id_thread = (blockIdx%x - 1) * blockDim%x + threadIdx%x

It should avoid the conflicting already.



Also, the total threads number can’t increase, really strange to me.

It’s because I have a loop to call kernel subroutine. The vector optimization by pgfortran -O3, seems to be reorganize the loop. Then several kernel subroutines may be launched together by the unproper optimization of the compiler, leads to the conflicts.

Temporary solution: change to the pgfortran -O0 to compile the code to call kernel.

Done!

The rounding error may not be treated as the same in GPU & CPU.
This error is then propagating in the Random Number Generation algorithm.

So even for Fermi, TESLA C2050 is not full IEEE754 standard?


Ref. from CUDA wiki:
For double precision (for GPUs supporting CUDA compute capability 1.3 and above[12]) there are some deviations from the IEEE 754 standard: round-to-nearest-even is the only supported rounding mode for reciprocal, division, and square root. In single precision, denormals and signalling NaNs are not supported; only two IEEE rounding modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.


The RNG Algorithm used here is very classical, something like the following.

Algorithm 1: Combined Multiple Recursive Generator

For i = 3 to n
Xi = (1403580Xi-2 - 810728Xi-3)(mod 4294967087)
Yi = (527612Yi-1 - 1370589Xi-3)(mod 4294944443)
Zi = (Xi - Yi)(mod 4294967087)
If (Zi > 0)
Ui = Zi/4294967088
If (Zi = 0)
Ui = 4294967087/4294967088
i = i + 1
End i

Hi Mat,

I have made two updates in my code:

  1. compile cell_loo_GPU.F90 in “-o0” model
    To avoid vector optimization in “do-looping” of calling kernel.
    settled the 414BLOCK_SIZE problem

  2. Round off random seeds in aleatoire_init_GPU
    This makes the exact same random number gotten by emu & GPU
    !===============================================================================

SUBROUTINE aleatoire_init_GPU(nblock)


! normalize the seeds
random(1:3,:) = AINT(random(1:3,:)m1,8)
random(4:6,:) = AINT(random(4:6,:)m2,8)
! write(0,
) “random seeds inited”
! write(0,
) random(:,1:10)

END SUBROUTINE aleatoire_init_GPU

!===============================================================================

But the results are still not the same between GPU & emu.
As we do not have a debug tool for GPU fortran , it’s really awful to investigate.

I hope the GPU fortran debug could be available soon. Otherwise the costs on debugging codes would be more than the translate the codes into C.

Could you tell me how to setup the “be” tool (.gpu to .ptx) to “-o0”?

gfwang

Bug report:

In kernel Fortran source code file:

cos_theta = 1.0d0 - 2.d0 * nb_aleatoire(randseed)

Compiled by pgfortran into low level c:

cos_theta = (1.00000000000000000E+0)-((nb_aleatoire((signed char*)(_prandseed)))+(nb_aleatoire((signed char*)(_prandseed))));

Notice nb_aleatoire is a function, the value is depending on the INTENT(INOUT) argument randseed. The translation is not equal.
Furthermore, I do not understand why the compiler would like to optimize this. As we all known, multiplication operation (*) is no slower then the addition in GPU or modern CPU, and much faster than getting a function value.

NOTE: your trial license will expire in 3 days, 12.7 hours.

I think I should be awarded with a longer term trail license of pgfortran compiler for my bug digging work in compiler itself.

But report (or future request) 2:

All initial GPU constant variables are set to 0 or 0.000 after pgfortran compiling into low-level C code (.gpu), without respection to user-defined values.

DOUBLE PRECISION, constant :: epsilon_paroi = 0.5

into

__constant__ struct{
int* m0;long long m8; ............................ ;double m2600;double m2608;
}__align__(16) _raycast_gpukernel_17 = { 0,0,......,-0.000000 };

Hi gfwang,

FYI, support for floating point atomics (TPR#17778) was added a few releases ago (sorry for the late update). The only caveat is that you need a device that supports CC2.0 to use them.

  • Mat