Cactus BenchADM crashes

I downloaded Cactus BenchADM benchmark and followed its tutorial.txt (as well as the article “Building Cactus BenchADM with PGI accelerator compilers” by Mathew Colgrove) to build and run the code. The cpu version compiles and runs correctly. The CUDA version (StaggeredLeapfrog2_acc1.F, came with the package) crashed during the run, although it complied correctly. I then tried other steps:acc2, acc3, they all gave the same behaviour.

I noticed that in the compiler message it shows
" 367, !$acc do parallel, vector(2)
371, !$acc do parallel, vector(3)" while the tutorial documents showed “vector(8)” for the same bits. I don’t know why they are different.

pgaccelinfo runs fine and the code compiles, so I guess I installed both CUDA and the compiler correctly.
I would appreciate any suggestions on what I need to do to make the run.

My system is RedHat 5.1, kernel 2.6.18-128.el5 x86_64 SMP
PGI 9.0.4
tesla c1060
CUDA 2.3

The error messages are:
[tester@bra-tesladev1 PGI_Acc_benchADM]$ make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc1 run_acc1
pgfortran -fast -ta=nvidia,time -Minfo=accel -c -o objdir/StaggeredLeapfrog2_acc1.o ./src/StaggeredLeapfrog2_acc1.F
NOTE: your trial license will expire in 12 days, 11.2 hours.
NOTE: your trial license will expire in 12 days, 11.2 hours.
366, Generating copyout(adm_kzz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kyz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(lalp(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyout(adm_kyy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxx_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(lgzz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgyz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgyy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxx(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(adm_kzz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kzz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxx_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxx_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
367, Loop is parallelizable
371, Loop is parallelizable
375, Loop is parallelizable
Accelerator kernel generated
367, !$acc do parallel, vector(2)
371, !$acc do parallel, vector(3)
375, !$acc do vector(16)
Using register for ‘adm_kxx_stag_p’
Using register for ‘adm_kxy_stag_p’
Using register for ‘adm_kxz_stag_p’
Using register for ‘adm_kyy_stag_p’
Using register for ‘adm_kyz_stag_p’
Using register for ‘adm_kzz_stag_p’
Non-stride-1 accesses for array ‘lgxx’
Non-stride-1 accesses for array ‘lgxy’
Cached references to size [18x5x4] block of ‘lgxz’
Cached references to size [18x5x4] block of ‘lgyy’
Cached references to size [18x5x4] block of ‘lgyz’
Cached references to size [18x5x4] block of ‘lgzz’
Cached references to size [18x5x4] block of ‘lalp’
pgfortran objdir/PreLoop.o objdir/StaggeredLeapfrog1a.o objdir/StaggeredLeapfrog1a_TS.o objdir/planewaves.o objdir/teukwaves.o /cctk_ThornBindings.o objdir/StaggeredLeapfrog2_acc1.o objdir/Cactus…

/InitialiseCactus_acc.o -fast -ta=nvidia,time -Minfo=accel -Mnomain -o bin/benchADM_acc1
time bin/benchADM_acc1 BenchADM_40l_120.par

1 0101 ************************
01 1010 10 The Cactus Code V4.0
1010 1101 011
1001 100101 ************************
100011 © Copyright The Authors
0100 GNU Licensed. No Warranty

Cactus version: 4.0.b11
Parameter file: BenchADM_40l_120.par

Activating thorn Cactus…Success -> active implementation Cactus
Activation requested for
—>einstein time benchadm pugh pughreduce cartgrid3d ioutil iobasic<—
Activating thorn benchadm…Success -> active implementation benchadm
Activating thorn cartgrid3d…Success -> active implementation grid
Activating thorn einstein…Success -> active implementation einstein
Activating thorn iobasic…Success -> active implementation IOBasic
Activating thorn ioutil…Success -> active implementation IO
Activating thorn pugh…Success -> active implementation driver
Activating thorn pughreduce…Success -> active implementation reduce
Activating thorn time…Success -> active implementation time

if (recover)
Recover parameters

Startup routines
BenchADM: Register slicings
CartGrid3D: Register GH Extension for GridSymmetry
CartGrid3D: Register coordinates for the Cartesian grid
PUGH: Startup routine
IOUtil: Startup routine
IOBasic: Startup routine
PUGHReduce: Startup routine.

Parameter checking routines
BenchADM: Check parameters
CartGrid3D: Check coordinates for CartGrid3D

CartGrid3D: Set up spatial 3D Cartesian coordinates on the GH
Einstein: Set up GF symmetries
Einstein: Initialize slicing, setup priorities for mixed slicings
PUGH: Report on PUGH set up
Time: Initialise Time variables
Time: Set timestep based on Courant condition
Einstein: Initialisation for Einstein methods
Einstein: Flat initial data
BenchADM: Setup for ADM
Einstein: Set initial lapse to one
BenchADM: Time symmetric initial data for staggered leapfrog
if (recover)
if (checkpoint initial data)
if (analysis)
Einstein: Compute the trace of the extrinsic curvature
Einstein: Calculate the spherical metric in r,theta(q), phi§
Einstein: Calculate the spherical ex. curvature in r, theta(q), phi§

do loop over timesteps
Rotate timelevels
iteration = iteration + 1
t = t+dt
Einstein: Identify the slicing for the next iteration
BenchADM: Evolve using Staggered Leapfrog
if (checkpoint)
if (analysis)
Einstein: Compute the trace of the extrinsic curvature
Einstein: Calculate the spherical metric in r,theta(q), phi§
Einstein: Calculate the spherical ex. curvature in r, theta(q), phi§
Termination routines
PUGH: Termination routine
Shutdown routines

Driver provided by PUGH

INFO (IOBasic): I/O Method ‘Scalar’ registered
INFO (IOBasic): Scalar: Output of scalar quantities (grid scalars, reductions) to ASCII files
INFO (IOBasic): I/O Method ‘Info’ registered
INFO (IOBasic): Info: Output of scalar quantities (grid scalars, reductions) to screen
INFO (BenchADM): Evolve using the ADM system
INFO (BenchADM): with staggered leapfrog
INFO (CartGrid3D): Grid Spacings:
INFO (CartGrid3D): dx=>8.4033613e-03 dy=>8.4033613e-03 dz=>8.4033613e-03
INFO (CartGrid3D): Computational Coordinates:
INFO (CartGrid3D): x=>[-0.500, 0.500] y=>[-0.500, 0.500] z=>[-0.500, 0.500]
INFO (CartGrid3D): Indices of Physical Coordinates:
INFO (CartGrid3D): x=>[0,119] y=>[0,119] z=>[0,119]
INFO (PUGH): Single processor evolution
INFO (PUGH): 3-dimensional grid functions
INFO (PUGH): Size: 120 120 120
INFO (Einstein): Setting flat Minkowski space in Einstein
INFO (IOBasic): Info: Output every 10 iterations
INFO (IOBasic): Info: Output requested for EINSTEIN::gxx EINSTEIN::alp

it | | EINSTEIN::gxx | EINSTEIN::alp |
| t | minimum | maximum | minimum | maximum |

0 | 0.000 | 1.00000000 | 1.00000000 | 1.00000000 | 1.00000000 |
call to ctxSynchronize returned error 700: Launch failed

Accelerator Kernel Timing data
366: region entered 1 time
time(us): init=1
375: kernel launched 1 times
grid: [59x40] block: [16x3x2]
time(us): total=0 max=0 min=0 avg=0
1: region entered 1 time
time(us): init=51061
Command exited with non-zero status 1
1.12user 0.66system 0:01.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+183167minor)pagefaults 0swaps
make: *** [run_acc1] Error 1

Hi Skimmed,

A “ctxSynchronize returned error 700” error typically means that when copying over the data to the device, there was an access violation. Exactly why this is occurring, I’m not sure. Your -Minfo output looks correct (the vector message is just a difference between 9.0-4 and 9.0-3 which is what I used to write the tutorial).

The first thing I’d try is to reboot your system. I’ve seen a few times where the device driver gets messed up and starts giving odd errors like this.

Next, set “NVDEBUG=1” in your environment. This will give you a lot of debug information but show exactly which variable is causing the crash.

Also, try one of the smaller examples found in “$PGI/linux86-64/9.0-4/etc/samples”. If these fail as well, then I’m leaning towards a system issue rather than compiler.

  • Mat

Thanks, Mat.

A reboot eventually sorted things out and now the code runs. However I noticed that compared with your results, my data value (27132909 vs. 7112575) is almost four times as big. Is there a way to improve on this by tuning compiler options or is it limited by hardware?

Accelerator Kernel Timing data
369: region entered 100 times
time(us): total=35310202 init=99 region=35310103
kernels=8177194 data=27132909
w/o init: total=35310103 max=382721 min=351194 avg=353101
410: kernel launched 100 times
grid: [118x15] block: [8x32]
time(us): total=8177194 max=82600 min=81376 avg=81771
1: region entered 1 time
time(us): init=51528
54.93user 8.19system 1:03.30elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17major+2205915minor)pagefaults 0swaps

Check which PCI slot your card is plugged into. I had a similar issue when I had a card in slot with a x4 link speed instead of the x16 link. You might need to check your motherboard documentation to determine which PCIe slot is which. Most likely the PCIe slots closest the CPU are the x16 link.

  • Mat

Thanks very much for your help.
The machine (dell precision 690) has two PCI-E 16x slots which are occupied by a Tesla c1060 and a quadro fx1400. No matter which slots the Tesla was in, I got exactly the same results.
CUDA bandwidth test showed it had 1300MB/s uploading and 988MB/s downloading, which are very slow.
It appears to be a configuration issue but at the moment I have no clue on how to solve it.

Hey Skimmed,

Can you check if your machine has the ‘optional graphics riser card’ installed, as seen here:

If the riser card is installed, I think the next step to try would be to remove both graphics cards, remove the riser and install only the Tesla card in the first (x16) slot. I would also remove any other PCI or PCIe expansion cards you may have installed. This probably means you will have to remotely log into the machine since it doesn’t look like it has an onboard GPU.

Hi Dholt

The machine indeed had riser card installed, and it does not have an onboard graphic card. Since it meant to be a workstation for development, I probably have to live with the not-so-ideal bandwidth before a new machine can be found. Shame such a beefy machine supports only one x16 channel and Dell’s trick to fool us into believing there are two.

Thanks again for all the help.