While the benchmark is running the nvidia-smi command shows barely any usage: ~45W/300W, 0% GPU-Util, ~2400/~16000 MiB Memory. Could it be that the benchmark isn’t even using the GPUs? Or is nvidia-smi not the right way to check that?
Since I’m not getting any kind of warning or error message (every result says PASSED too) I don’t know what to do next or where to change settings (HPL.dat? run_linpack?).
If you need more information to assist me I will gladly provide them.
You would need to make sure your MPI is configured to use infiniband.
Also, 0.4TF is low for the single node score. Honestly, I can’t figure out how you got 3.5TF across 7 nodes if you only got 0.4TF on a single node.
You’ll need to use a much larger problem size to get full perf out of 4 P100 GPUs in a single node, and it’s very likely that the 64GB main memory per node will be a limiting factor here.
I would start by focusing on getting the most performance from a single node. What is the largest problem size you can run with a 2x2 grid?
Can you get higher performance (than 0.4TF) if you just run on a single GPU? (so P=Q=1 in that case)
HPL (HPC Linpack) is outside my area of expertise.
However, the measured 0.4 TF per node in Linpack combined with the fact that nvidia-smi reports 0% GPU utilization and 45 W power consumption for each GPU strongly suggests that the GPUs are not being used.
While it is not clear what system memory requirements your day-to-day workloads have, the system memory seems undersized for a general-purpose HPC node, as txbob says. You would want 4-8 GB per CPU core, and your system has 20 CPU cores per node. Also, system memory to GPU memory ratio should be 2:1 to 4:1, and you have 64 GB of GPU memory per node.
For the 1x1 I did change the values of CPU_CORES_PER_GPU, CUDA_DGEMM_SPLIT and CUDA_DTRSM_SPLIT in run_linpack slightly, though.
Using the method linked below to measure utilization etc. I always got the same symptoms when I started the benchmark: power draw increases just a little (idle 33W to 45W), 0% GPU-Utilization with just a small peak < 10% in the first 1 or 2 seconds, small increase in memory usage (0 MiB to 2500 MiB).
@njuffa:
my universty purchased that cluster last year, I’m now just using it. Except for 8 so called fat nodes with 256 GB memory each (unfortunately without GPU), all ~300+ nodes have only 64 GB memory.
To get higher scores, you need to push N higher (as high as it will go). If you search around, you can find rules of thumb for how to compute the max N for a given machine (a given system memory size) but for the 1x1 and 2x2 cases you can just use trial and error.
If N=85000 is the highest you can go, then that would indicate that system memory is the limiting factor here.
I am sorry to read that. Maybe the sudden increase in DRAM prices caught them by surprise and they had to cut system memory size because of it. IMHO, 256 GB per node would be optimal given other system specs.
Yes, with a small problem size (N), you are doing very little work overall, and the GPU isn’t doing much. This should be evident from the low score. For a decent run, you want N that is well over 100000.
64GB system memory is just too small to be interesting for GPU accelerated HPL. The sizable memory allocation (gigabytes) means the GPUs are being used during this test. Just not to their capacity/capability.
And it makes sense that 88000 would top out a 64GB config.
You can get plenty of CUDA work done with 64 GB of system memory. However, if your plan was to run at maximum performance using all four Teslas P100 at the same time, you may find the small system memory to be a limiting factor more often than you care for. I learned the hard way that skimping on system memory is not the way to.
What kind of GPU-accelerated workloads do you anticipate running? For most well-known HPC applications there are hardware recommendations, including system memory size, so check the documentation of whatever apps you plan to run.
For guidance for well-balanced GPU-accelerated HPC nodes, one could look to NVIDIA’s DGX-1 (2x E5-2698 with 40 CPU cores, 8x P100 with 128 GB of memory, 512 GB system memory) or the nodes for the upcoming Summit supercomputer (2x Power 9 with 44 CPU cores, 4x V100 with 64 GB of memory, 512 GB system memory).
I did not bother to repeat that statement since you had already linked that thread. However you cannot and will not get “full” performance out of any newer GPU using this particular HPL distribution. However I assume that you are simply questioning the results you are getting, and what the limiting factors may be. I believe one possible limiting factor is (system) memory size.
I did some more testing including using the nvprof command to see what the GPUs are doing. I have to admit I dont’t really understand it’s output (or if it’s feasible to use in my case at all) but someone here maybe can help me understand it.
I’m especially curious about line 9, 19, 29, 39 in the second code block. Shouldn’t there be a value for “Grid Size” and “Block Size”?
[(...)@gpu08 CUDA]$ nvprof --profile-child-processes --print-gpu-trace mpirun -np 4 ./run_linpack
==7381== NVPROF is profiling process 7381, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== NVPROF is profiling process 7383, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7379== NVPROF is profiling process 7379, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7382== NVPROF is profiling process 7382, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
2.91844s 1.6000us - - - - - 112B 66.757MB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7379== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7379== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
2.95164s 1.5040us - - - - - 112B 71.018MB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7382== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7382== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
2.94910s 1.5680us - - - - - 112B 68.120MB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7381== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7381== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
2.98803s 1.6960us - - - - - 112B 62.978MB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
I just don’t want to skip any possibility to figure out what the limiting factor is.