qacct and IB (mvapich)

I discovered that the cpu usage (cpu = stime + utime) returned by qacct (accounting under rocks/SGE) is off for PGI/MPI jobs linked and run w/ IB (infiniband, mvapich).

I built and ran the same code and test case on our Rocks/SGE cluster w/ mpich and mvpich, respectively. I then qacct -j the two jobs and got utime of 0.622 vs 59.9 – which does not make sense (both version ran OK).

Does anybody has a clue what is going on?

S.

Hi S.

Sorry, but I have no idea. Though, this might be a question better address by your local IT support? Maybe the times are correct and the mvapich build is not configured correctly? Are the times repeatable? Do other programs show similar behavior?

  • Mat

The times are dead wrong, and the problem is observed with more than one code - I’ve built and ran the High Performance Computing Linpack Benchmark (HPL) ver 2.0

I’ve tested several versions: compiled w/ gnu, intel and pgi compilers, w/out and w/ IB support (mpich or mvapich). In all cases, I used the vendor-provided libs (Intel Cluster Studio, PGI Cluster Dev Kit).

What I get is for the two test cases is:

job type        compiler  nCPUs  wallclock    utime      stime         cpu

1x16384-2x2     gnu          4   182.467    705.975     23.394     729.369  
1x16384-2x2     gnu+ib       4   183.783      0.004      0.002       0.006  
1x16384-2x2     gnu-v143     4   183.083    706.586     25.238     731.825  
1x16384-2x2     intel        4   217.317    861.706      6.939     868.645  
1x16384-2x2     intel+ib     4   216.783    859.793      7.002     866.795  
1x16384-2x2     pgi          4   391.117    200.263     62.671     262.934  
1x16384-2x2     pgi+ib       4   339.917      0.009      0.006       0.016  

1x16384-16x16   gnu        256   255.483   1049.666   5070.615    6120.280  
1x16384-16x16   gnu+ib     256   233.233      0.006      0.018       0.024  
1x16384-16x16   gnu-v143   256    11.450    175.701     88.706     264.407  
1x16384-16x16   intel      256   170.717  22349.205  21290.297   43639.502  
1x16384-16x16   intel+ib   256   555.383  99525.704  40159.900  139685.604  
1x16384-16x16   pgi        256   176.467     11.895     42.559      54.454  
1x16384-16x16   pgi+ib     256    14.783      0.016      0.045       0.060

The pgi+ib stime, utime and cpu can’t be right. and appears as off as the gnu+ib cases… The utime cannot change whether using IB or not, w/ wacko qacct results, it is almost impossible to measure the efficiency of using the IB

Any idea what may cause this?