I’m trying to reproduce cudaq multi-GPU setting where the state vector is too large to fit into 1 GPU and multiple GPUs are needed for a single circuit simulation, as described in ’ Batching Hamiltonian Terms’ section of
About a year ago you gave presentation at NERSC
nersc-quantum-day/demo/multinode.script at master · poojarao8/nersc-quantum-day · GitHub
describing how to do it on a the Perlmutter supercomputer managed by Slurm.
I can’t get it to work.
This is the GHZ circuit I want to run and I only change the number of qubits, just to make it too big to fit in 1 A100 GPU with 80 GB of memory.
$ cat ghz_big.py
import cudaq
import os
def ghz_state(N):
kernel = cudaq.make_kernel()
q = kernel.qalloc(N)
kernel.h(q[0])
for i in range(N - 1):
kernel.cx(q[i], q[i + 1])
kernel.mz(q)
return kernel
n = 33
myRank=os.environ['myRank']
print("My Rank %s GHZ state for %d qubits"%(myRank,n))
kernel = ghz_state(n)
cudaq.set_target("nvidia-mgpu")
counts = cudaq.sample(kernel, shots_count=2)
counts.dump()
I also use podman-hpc image containing your software stack.
This is how allocate a single Perlmutter node with 4 GPUs:
salloc -q interactive -C gpu -t 4:00:00 -A nstaff --gpus-per-task=1 --ntasks 4 --gpu-bind=none --module=cuda-mpich
And I see 4 GPUs:
balewski@nid200389:$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 |
| N/A 28C P0 62W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 26C P0 60W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:82:00.0 Off | 0 |
| N/A 27C P0 62W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 61W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
In order to run it inside podman container I need a wrapper script which will pass the task rank index inside container
cat run_big.sh
#!/bin/bash -l
WORK_DIR=xxxxxx
IMG=balewski/cudaquanmpi-qiskit:j1
podman-hpc run --gpu -i --volume $WORK_DIR:/issues --workdir /issues -e myRank=$SLURM_PROCID $IMG <<EOF
nvidia-smi | grep A100-SXM
python3 ghz_big.py
EOF
=== Test 1===
set number of qubits to 32 and run on single GPU
balewski@nid200389:$ srun -l -n 1 run_big.sh
0: | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | 0 |
0: WARNING: Running on only 1 GPU.
0: Run the program with 'mpirun -np N' to run on N GPUs.
0: { 00000000000000000000000000000000:1 11111111111111111111111111111111:1 }
0: My Rank 0 GHZ state for 32 qubits
All works.
=== Test 2===
set number of qubits to 33. This is too much for 1 GPU, so I added a 2nd GPU. Now each task sees 2 GPUs, but the job crashes
balewski@nid200389:$ srun -l -n 2 run_big.sh
0: [0/1] error: failed to create sub statevector
0: [0/1] error: failed to create sub statevector
0: | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | 0 |
0: | 1 NVIDIA A100-SXM... Off | 00000000:41:00.0 Off | 0 |
0: My Rank 0 GHZ state for 33 qubits
0: Run the program with 'mpirun -np N' to run on N GPUs.
0: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
0: WARNING: Running on only 1 GPU.
1: [0/1] error: failed to create sub statevector
1: [0/1] error: failed to create sub statevector
1: | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | 0 |
1: | 1 NVIDIA A100-SXM... Off | 00000000:41:00.0 Off | 0 |
1: My Rank 1 GHZ state for 33 qubits
1: Run the program with 'mpirun -np N' to run on N GPUs.
1: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
1: WARNING: Running on only 1 GPU.
srun: error: nid200389: task 0: Exited with exit code 1
srun: error: nid200389: task 1: Exited with exit code 1
srun: Terminating StepId=27283833.2
Can you help me, please? Why do I see warning about 1 GPU if each rank sees 2 GPUs?