I discovered that the cpu usage (cpu = stime + utime) returned by qacct (accounting under rocks/SGE) is off for PGI/MPI jobs linked and run w/ IB (infiniband, mvapich).
I built and ran the same code and test case on our Rocks/SGE cluster w/ mpich and mvpich, respectively. I then qacct -j the two jobs and got utime of 0.622 vs 59.9 – which does not make sense (both version ran OK).
Does anybody has a clue what is going on?
S.
Hi S.
Sorry, but I have no idea. Though, this might be a question better address by your local IT support? Maybe the times are correct and the mvapich build is not configured correctly? Are the times repeatable? Do other programs show similar behavior?
The times are dead wrong, and the problem is observed with more than one code - I’ve built and ran the High Performance Computing Linpack Benchmark (HPL) ver 2.0
I’ve tested several versions: compiled w/ gnu, intel and pgi compilers, w/out and w/ IB support (mpich or mvapich). In all cases, I used the vendor-provided libs (Intel Cluster Studio, PGI Cluster Dev Kit).
What I get is for the two test cases is:
job type compiler nCPUs wallclock utime stime cpu
1x16384-2x2 gnu 4 182.467 705.975 23.394 729.369
1x16384-2x2 gnu+ib 4 183.783 0.004 0.002 0.006
1x16384-2x2 gnu-v143 4 183.083 706.586 25.238 731.825
1x16384-2x2 intel 4 217.317 861.706 6.939 868.645
1x16384-2x2 intel+ib 4 216.783 859.793 7.002 866.795
1x16384-2x2 pgi 4 391.117 200.263 62.671 262.934
1x16384-2x2 pgi+ib 4 339.917 0.009 0.006 0.016
1x16384-16x16 gnu 256 255.483 1049.666 5070.615 6120.280
1x16384-16x16 gnu+ib 256 233.233 0.006 0.018 0.024
1x16384-16x16 gnu-v143 256 11.450 175.701 88.706 264.407
1x16384-16x16 intel 256 170.717 22349.205 21290.297 43639.502
1x16384-16x16 intel+ib 256 555.383 99525.704 40159.900 139685.604
1x16384-16x16 pgi 256 176.467 11.895 42.559 54.454
1x16384-16x16 pgi+ib 256 14.783 0.016 0.045 0.060
The pgi+ib stime, utime and cpu can’t be right. and appears as off as the gnu+ib cases… The utime cannot change whether using IB or not, w/ wacko qacct results, it is almost impossible to measure the efficiency of using the IB
Any idea what may cause this?