Nvidia-mgpu problem for 33 qubit state vector

I’m trying to reproduce cudaq multi-GPU setting where the state vector is too large to fit into 1 GPU and multiple GPUs are needed for a single circuit simulation, as described in ’ Batching Hamiltonian Terms’ section of

Multi-GPU Workflows — NVIDIA CUDA-Q documentation

About a year ago you gave presentation at NERSC

nersc-quantum-day/demo/multinode.script at master · poojarao8/nersc-quantum-day · GitHub

describing how to do it on a the Perlmutter supercomputer managed by Slurm.
I can’t get it to work.

This is the GHZ circuit I want to run and I only change the number of qubits, just to make it too big to fit in 1 A100 GPU with 80 GB of memory.

$ cat ghz_big.py
import cudaq
import os

def ghz_state(N):
    kernel = cudaq.make_kernel()
    q = kernel.qalloc(N)
    kernel.h(q[0])
    for i in range(N - 1):
      kernel.cx(q[i], q[i + 1])
 
    kernel.mz(q)
    return kernel

n = 33
myRank=os.environ['myRank']
print("My Rank %s  GHZ state for %d qubits"%(myRank,n))
kernel = ghz_state(n)
cudaq.set_target("nvidia-mgpu")
counts = cudaq.sample(kernel, shots_count=2)
counts.dump()

I also use podman-hpc image containing your software stack.

This is how allocate a single Perlmutter node with 4 GPUs:
salloc -q interactive -C gpu -t 4:00:00 -A nstaff --gpus-per-task=1 --ntasks 4 --gpu-bind=none --module=cuda-mpich
And I see 4 GPUs:

balewski@nid200389:$ nvidia-smi 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:03:00.0 Off |                    0 |
| N/A   28C    P0    62W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   26C    P0    60W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:82:00.0 Off |                    0 |
| N/A   27C    P0    62W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    61W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

In order to run it inside podman container I need a wrapper script which will pass the task rank index inside container

cat run_big.sh
#!/bin/bash -l
WORK_DIR=xxxxxx
IMG=balewski/cudaquanmpi-qiskit:j1
podman-hpc run  --gpu  -i  --volume $WORK_DIR:/issues  --workdir /issues -e myRank=$SLURM_PROCID  $IMG    <<EOF 
	nvidia-smi | grep A100-SXM
	python3 ghz_big.py 	
EOF

=== Test 1===
set number of qubits to 32 and run on single GPU

balewski@nid200389:$ srun -l  -n 1 run_big.sh
 0: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
 0: WARNING: Running on only 1 GPU.
 0: Run the program with 'mpirun -np N' to run on N GPUs.
 0: { 00000000000000000000000000000000:1 11111111111111111111111111111111:1 }
 0: My Rank 0  GHZ state for 32 qubits

All works.

=== Test 2===
set number of qubits to 33. This is too much for 1 GPU, so I added a 2nd GPU. Now each task sees 2 GPUs, but the job crashes

balewski@nid200389:$ srun -l  -n 2 run_big.sh
0: [0/1] error: failed to create sub statevector
0: [0/1] error: failed to create sub statevector
0: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
0: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
0: My Rank 0  GHZ state for 33 qubits
0: Run the program with 'mpirun -np N' to run on N GPUs.
0: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
0: WARNING: Running on only 1 GPU.
1: [0/1] error: failed to create sub statevector
1: [0/1] error: failed to create sub statevector
1: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
1: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
1: My Rank 1  GHZ state for 33 qubits
1: Run the program with 'mpirun -np N' to run on N GPUs.
1: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
1: WARNING: Running on only 1 GPU.
srun: error: nid200389: task 0: Exited with exit code 1
srun: error: nid200389: task 1: Exited with exit code 1
srun: Terminating StepId=27283833.2

Can you help me, please? Why do I see warning about 1 GPU if each rank sees 2 GPUs?

1 Like

Hi @janb, does this work if you specify 4 GPUs or do you get the same error?

The initial salloc is told:
--gpus-per-task=1 --ntasks 4
what defines 4 GPUs to be used.
In which of the layers of execution of this pipeline you want me to add the redundant information about 4 GPUs?

Hi @janb , In test 2 when you add a second GPU and it crashes, can you try srun -l -n 4 run_big.sh to add all 4 GPUs?

=== Test 3===
add all 4 GPUs: srun -l -n 4 run_big.sh with n = 33
Also failed. Logically thinking: nq=32 fits into 1 GPU, therefore nq=33 must fit into 2 GPUs, because state vector grows like 2^nq, so it only doubled for nq=33

salloc -q interactive -C gpu -t 4:00:00 -A nstaff --gpus-per-task=1 --ntasks 4 --gpu-bind=none --module=cuda-mpich

balewski@nid200269:~/prjs/2024_martin_gradient/issues> srun -l  -n 4 run_big.sh
srun: Job 27648353 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=27648353.6
1: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
1: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
1: |   2  NVIDIA A100-SXM...  Off  | 00000000:82:00.0 Off |                    0 |
1: |   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
3: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
3: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
3: |   2  NVIDIA A100-SXM...  Off  | 00000000:82:00.0 Off |                    0 |
3: |   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
0: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
0: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
0: |   2  NVIDIA A100-SXM...  Off  | 00000000:82:00.0 Off |                    0 |
0: |   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
2: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
2: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
2: |   2  NVIDIA A100-SXM...  Off  | 00000000:82:00.0 Off |                    0 |
2: |   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
2: WARNING: Running on only 1 GPU.
2: Run the program with 'mpirun -np N' to run on N GPUs.
1: WARNING: Running on only 1 GPU.
1: Run the program with 'mpirun -np N' to run on N GPUs.
0: WARNING: Running on only 1 GPU.
0: Run the program with 'mpirun -np N' to run on N GPUs.
3: WARNING: Running on only 1 GPU.
3: Run the program with 'mpirun -np N' to run on N GPUs.
0: My Rank 0  GHZ state for 33 qubits
0: [0/1] error: failed to create sub statevector
0: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
1: [0/1] error: failed to create sub statevector
1: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
1: My Rank 1  GHZ state for 33 qubits
3: [0/1] error: failed to create sub statevector
3: My Rank 3  GHZ state for 33 qubits
3: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
2: My Rank 2  GHZ state for 33 qubits
2: [0/1] error: failed to create sub statevector
2: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
0: [0/1] error: failed to create sub statevector
1: [0/1] error: failed to create sub statevector
3: [0/1] error: failed to create sub statevector
2: [0/1] error: failed to create sub statevector
srun: error: nid200269: tasks 0,3: Exited with exit code 1
srun: Terminating StepId=27648353.6
0: slurmstepd: error: *** STEP 27648353.6 ON nid200269 CANCELLED AT 2024-07-05T13:44:30 ***
srun: error: nid200269: tasks 1-2: Exited with exit code 1

=== Test 5===
Change GPU/task assignment to 4 and run 1 task - failed too.

salloc -q interactive -C gpu -t 4:00:00 -A nstaff --gpus-per-task=4 --ntasks 1 --gpu-bind=none --module=cuda-mpich

balewski@nid200308:~/prjs/2024_martin_gradient/issues> srun -l  -n 1 run_big.sh
0: |   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
0: |   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
0: |   2  NVIDIA A100-SXM...  Off  | 00000000:82:00.0 Off |                    0 |
0: |   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
0: WARNING: Running on only 1 GPU.
0: Run the program with 'mpirun -np N' to run on N GPUs.
0: [0/1] error: failed to create sub statevector
0: My Rank 0  GHZ state for 33 qubits
0: RuntimeError: Could not allocate state vector. Too few GPUs for too many qubits.
0: [0/1] error: failed to create sub statevector
srun: error: nid200308: task 0: Exited with exit code 1
srun: Terminating StepId=27648785.0