I’m running this on Fedora 20. With your update, the error is reported on line 396 of mcx_utils.c:
MCX_ASSERT(fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) )==3,__FILE__,__LINE__);
which appears to be reading this line from qtest.inp:
0.e+00 5.e-09 5.e-9 # time-gates(s): start, end, step
To work around that, I made the following changes to mcx_utils.c:
printf("fscanf: %d\n", fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) ));
printf("p1: %f, p2: %f, p3: %f\n", cfg->tstart, cfg->tend, cfg->tstep);
cfg->tstart = 0.e+00;
cfg->tend = 5.e-09;
cfg->tstep = 5.e-9;
// below is the original line 396, above code is added immediately prior to it
// MCX_ASSERT(fscanf(in,"%f %f %f", &(cfg->tstart),&(cfg->tend),&(cfg->tstep) )==3,__FILE__,__LINE__);
With that, the extra output from the added printf statements above looks like this:
fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000
I didn’t bother trying to debug that any further. There seems to be something messed up in the formatted input.
Anyway, the above changes also put the “correct” values in for those parameters. With that, I get results like this compiled with CUDA 7.5 on a GTX960 (note that as already indicated in entry 10 above, your Makefile defaults to -arch=sm_20):
$ ./run_qtest.sh
fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000
autopilot mode: setting thread number to 16384, block size to 64 and time gates to 1
###############################################################################
# Monte Carlo eXtreme (MCX) -- CUDA #
# Copyright (c) 2009-2015 Qianqian Fang <q.fang at neu.edu> #
# #
# Computational Imaging Laboratory (CIL) #
# Department of Bioengineering, Northeastern University #
###############################################################################
$MCX $Rev:: $ Last Commit $Date:: $ by $Author:: fangq$
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [7050]
- compiled with: RNG [Logistic-Lattice] with Seed Length [5]
- this version CAN save photons at the detectors
GPU=1 threadph=610 oddphotons=5760 np=10000000 nthread=16384 maxgate=1 repetition=1
initializing streams ... init complete : 0 ms
requesting 2560 bytes of shared memory
lauching MCX simulation for time window [0.00e+00ns 5.00e+00ns] ...
simulation run# 1 ... kernel complete: 19294 ms
retrieving fields ... detected 30045 photons, total: 30045 transfer complete: 19313 ms
data normalization complete : 19313 ms
normalizing raw data ... normalization factor alpha=20.000000
saving data to file ... 216000 1 saving data complete : 19324 ms
simulated 10000000 photons (10000000) with 16384 threads (repeat x1)
MCX simulation speed: 518.48 photon/ms
total simulated energy: 10000000.00 absorbed: 17.69411%
(loss due to initial specular reflection is excluded in the total)
real 0m20.536s
user 0m13.647s
sys 0m5.997s
$
And with CUDA 6.5 I see this:
$ ./run_qtest.sh
fscanf: 1
p1: 0.000000, p2: 0.000000, p3: 0.000000
autopilot mode: setting thread number to 16384, block size to 64 and time gates to 1
###############################################################################
# Monte Carlo eXtreme (MCX) -- CUDA #
# Copyright (c) 2009-2015 Qianqian Fang <q.fang at neu.edu> #
# #
# Computational Imaging Laboratory (CIL) #
# Department of Bioengineering, Northeastern University #
###############################################################################
$MCX $Rev:: $ Last Commit $Date:: $ by $Author:: fangq$
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [6050]
- compiled with: RNG [Logistic-Lattice] with Seed Length [5]
- this version CAN save photons at the detectors
GPU=1 threadph=610 oddphotons=5760 np=10000000 nthread=16384 maxgate=1 repetition=1
initializing streams ... init complete : 0 ms
requesting 2560 bytes of shared memory
lauching MCX simulation for time window [0.00e+00ns 5.00e+00ns] ...
simulation run# 1 ... kernel complete: 16116 ms
retrieving fields ... detected 30051 photons, total: 30051 transfer complete: 16135 ms
data normalization complete : 16136 ms
normalizing raw data ... normalization factor alpha=20.000000
saving data to file ... 216000 1 saving data complete : 16147 ms
simulated 10000000 photons (10000000) with 16384 threads (repeat x1)
MCX simulation speed: 620.77 photon/ms
total simulated energy: 10000000.00 absorbed: 17.69432%
(loss due to initial specular reflection is excluded in the total)
real 0m18.357s
user 0m12.568s
sys 0m4.860s
$
(I happen to be using GPU driver 361.28)
In any event, there seems to be about a 20% difference in performance, not 10x. The reported absorption seems to be approximately the same at ~17.7% in both cases. So at the moment I’m unable to reproduce the 10x claim.