Hi Mat,
Thanks a lot.
Now, I manage to make it works. several efforts are made:
- change some expressions in kernel code, like
point_in(:) = 0.99D0 * A_dev(:, point) &
+ 0.01D0/4.0D0 * (A_dev(:,B_dev(1,cell)) + A_dev(:,B_dev(2,cell)) + A_dev(:,B_dev(3,cell)) + A_dev(:,B_dev(4,cell)))
Change into
point1 = B_dev(1,cell)
point2 = B_dev(2,cell)
point3 = B_dev(3,cell)
point4 = B_dev(4,cell)
point_in(:) = 0.99D0 * A_dev(:, point) &
+ 0.01D0/4.0D0 * (A_dev(:,point1) + A_dev(:,point2) + A_dev(:,point3) + A_dev(:,point4))
A_dev are DOUBLE PRECISION, device arrays; B_dev are INTEGER, device arrays.
- stop the optimization of pgfortran link in makefile
F90 = pgfortran -module $(MODULE) -Mmpi=mpich1 -Mcuda -Mbyteswapio
OPT = -O3 -tp nehalem-64
GPUOPT = -O3 -ta=nvidia:cuda3.2 -Mcuda=keepbin -Mcuda=keepptx -Mcuda=ptxinfo -v
LN_OPT = -O0
#
# compilation rules
#
.SUFFIXES : .CUF .c .f .f90 .mod .F90
.mod.o :
$(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.CUF.o :
$(F90) $(GPUOPT) $(INCLUDES) -o $*.o -c $*.CUF
.f90.o :
$(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.F90.o :
$(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.F90
.c.o :
$(CC) $(INCLUDES) -o $*.o -c $*.c
# $(CC) -O3 $(INCLUDES) -ffast-math -o $*.o -c $*.c
# compilation
#
$(TARGET) : $(OBJECTS_F90) $(OBJECTS_C)
$(F90) $(LN_OPT) $(OBJECTS_C) $(OBJECTS_F90) -o $@
#
If I change it with LN_OPT = -O3 (linking parameter), the run is stopped with the same error. Maybe there is a bug in the pgfortran link (strange). I try to provide my codes to you for test.
- change the block & grid parameters to run kernel
CALL raycast<<<ATOMIC_RAYS/BLOCK_SIZE,BLOCK_SIZE>>>(point_dev, cell_dev, simul_dev,energy_inter_dev)
defined as
INTEGER, PARAMETER :: BLOCK_SIZE=128
INTEGER, parameter :: ATOMIC_RAYS = 14*BLOCK_SIZE
But if I increase ATOMIC_RAYS into a larger value, for example 214BLOCKSIZE, the code stopped with the same error.
I do not quite know about these two parameters.
BLOCK_SIZE should be 32, 64, 96, 128, 256 … 1024? But also limited by registers?
In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value? And what is the lmem, smem, do they limit the total thread number?
....
ptxas info : Compiling entry function 'raycast' for 'sm_13'
ptxas info : Used 122 registers, 168+0 bytes lmem, 40+16 bytes smem, 2768 bytes cmem[0], 120 bytes cmem[1], 4 bytes cmem[14]
....
ptxas info : Compiling entry function 'raycast' for 'sm_20'
ptxas info : Used 63 registers, 8+0 bytes lmem, 72 bytes cmem[0], 2768 bytes cmem[2], 4 bytes cmem[14], 40 bytes cmem[16]
0 inform, 0 warnings, 0 severes, 0 fatal for ..cuda_fortran_constructor_1
PGF90/x86-64 Linux 11.3-0: compilation successful
And my GPU:
One CUDA device found
Device Number: 0
Device name: Tesla M2050
Compute Capability: 2.0
Number of Multiprocessors: 14
Number of Cores: 448
Max Clock Rate (kHz): 1147000
Warpsize: 32
Execution Configuration Limits
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Block Dimensions: 1024 x 1024 x 64
Maximum Threads per Block: 1024
Off-Chip Memory
Total Global Memory (B): 2817982464
Total Constant Memory (B): 65536
Maximum Memory Pitch for Copies (B): 2147483647
Integrated: No
On-Chip Memory
Shared Memory per Multiprocessor (B): 49152
Number of Registers per Multiprocessor: 32768
How to setup best blocksize and grid parameter to call kernel? If we have sufficient loads to execute.
Possible, but given the code you had posted earlier, it’s more likely a programing error. I would need a reproducing example to be sure. Please feel free to send the code to PGI Customer Service (> trs@pgroup.com> ) and ask them to send it to me.
Thanks a lot. It’s great to me, if you can help me to review the code.
I would try, though the program has a big input files (200M) to run.
Though, since it is running on the CPU, there still can be differences than running on the GPU. For example, on a CPU if you write beyond the end of an array, the code most likely wont seg fault. You may stomp over another variable’s data and cause other problems, but not seg fault. On the GPU, accessing memory even one element beyond the end of an array will trigger a seg fault. Adding array bounds checking (-Mbounds) in emulation mode should help find these errors.
I have used -Mbounds, since I read your posts (long-time ago) in the forum. It seems to be not a array bounds problem, at least not a simple bounds checking issue.
In emu mode, it works and gets the exact results as the CPU version. With the efforts mentioned above, it works in CUDA. But the results is not right.
Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.
Thanks again for your help. And thanks in advance for your comments.