Monte Carlo Example on Fermis Not Working?

I’m hoping someone can help me with an oddity I’m seeing with the Monte Carlo example. I’ve gotten access to a Fermi system so I’m learning how to use them carefully and methodically.

If I run the CUF1 example on a Tesla T10 system:

> make DFLAG=-DUSE_SMALL run_CUF1 
pgfortran -fast -c -Iinc ./src/mcUtils.F90 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 ./src/mcCUF_1.F90 -o ./obj/mcCUF_1.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 -DMCTYPE=11 ./src/monte_drv.F90 -o ./obj/monte_drv_cuf1.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf1.o ./obj/mcUtils.o ./obj/mcCUF_1.o  -o mcCUF_1.out
time  mcCUF_1.out
 ----- CUF1 ----- 
 Result =     3.142020    
 Standard deviation =    1.0021195E-04
 Difference from real PI value =    4.2748451E-04
 Time in Seconds 
    Total :    5.07815
      RNG :    3.07867
  Compute :    0.09659
Data Xfer :    0.59191
3.49user 1.43system 0:05.16elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+197764minor)pagefaults 0swaps

That looks good. Now we run on a Fermi system:

> make DFLAG=-DUSE_SMALL run_CUF1
pgfortran -fast -c -Iinc ./src/mcUtils.F90 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 ./src/mcCUF_1.F90 -o ./obj/mcCUF_1.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 -DMCTYPE=11 ./src/monte_drv.F90 -o ./obj/monte_drv_cuf1.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf1.o ./obj/mcUtils.o ./obj/mcCUF_1.o  -o mcCUF_1.out
time  mcCUF_1.out
 ----- CUF1 ----- 
 Result =   -1.2149596E+14
 Standard deviation =              Inf
 Difference from real PI value =    1.2149596E+14
 Time in Seconds 
    Total :    9.68955
      RNG :    3.14654
  Compute :    0.08587
Data Xfer :    0.69943
3.59user 6.02system 0:09.92elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+197882minor)pagefaults 0swaps

Any ideas why the result is so bad? And, I suppose, why the Total time has increased so much?

It is the same compiler (11.1) and same example, so the only difference is Tesla to Fermi. If it helps, examples CUF4 and CUF5 do seem to work.

Thanks,
Matt

Hi Matt,

I’m assuming that this is a bug in my original code where I wasn’t setting all values of dtemp. Though, I found this error in July 2010 and updated the source package on our website soon after. Can you check if you have my latest code?

Thanks,
Mat

I grabbed this tarball:http://www.pgroup.com/lit/samples/pginsider/pgi_mc_example.tar.gz

When I wget it, it has a date stamp of 2010-02-24 and most of the files within are around that date as well.

We’ll shoot. I guess I never verified that the source did really get updated. I’ll work on getting this fixed.

The quick fix is to use sizes that are divisible by 256. So change monte_drv.F90 to use new values for N:

#if defined(USE_SMALL)
!  PARAMETER(N=16777215_4, PI=3.1415926535_4)
   PARAMETER(N=16776960_4, PI=3.1415926535_4)
#else
!  PARAMETER(N=67108860_4, PI=3.1415926535_4)
   PARAMETER(N=67108608_4, PI=3.1415926535_4)
#endif

A better fix would be to launch more threads than needed and then check that the ‘i’ index in the kernel is not greater than N. i.e. change the “dimGrid = dim3(N/dimBlock%x,1,1)” to “dimGrid = dim3((N+dimBlock%x-1)/dimBlock%x,1,1)” and then put an if statement in the kernel to make sure i is less then N.

Though, the first few examples are intentionally poor implementations so I’ll just update N in the driver code.

Thanks,
Mat