Cudaq.sample(kernel,shots=1e7) Segmentation fault

I want to use cudaq to run large scale simulations, aiming at 1e9 shots per circuit.
It seems cudaq.sample(…) crashes if I ask for more than 1M shots, even for a bell-state circuit.
The code below works for shots=1024*1000

got GPU, run 1024000 shots
{ 00:512936 11:511064 }

but if I ask for shots=1024*1024 it crashes

got GPU, run 1048576 shots
Segmentation fault

This is the software stack I’m using

# pip3 list|grep cuda
cuda-quantum              0.7.1
cupy-cuda12x              13.2.0
CUDA-Q Version 0.7.1 (https://github.com/NVIDIA/cuda-quantum 1f8dd79d46cad9b9bd0eb220eb04408a2e6beda4)

I run on GPU is A100

# nvidia-smi 
Thu Jun 20 18:17:40 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:C3:00.0 Off |                    0 |
| N/A   35C    P0    40W / 250W |    418MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Demonstrator code:

import cudaq
print(cudaq.__version__)
qubit_count = 2

@cudaq.kernel
def kernel(qubit_count: int):
    qvector = cudaq.qvector(qubit_count)
    h(qvector[0])
    for i in range(1, qubit_count):
        x.ctrl(qvector[0], qvector[i])
    mz(qvector)

print(cudaq.draw(kernel, qubit_count))

cudaq.set_target("nvidia")
shots=1024*1024 
print('got GPU, run %d shots'%shots)
result = cudaq.sample(kernel, qubit_count, shots_count=shots)
print(result)

Hi @janb. Thank you so much for raising this issue and using CUDA-Q. Our engineering team has fixed the issue, and the PR can be found here. Please let us know if you have any other questions.

Hi, may I ask when the PR merges into the master branch?
And we are using Dockefile to customize the cuda-quantum environment. If there are anything else we need to change?

This is the podman file we are using.
FROM ubuntu:22.04
ARG arch=x86_64

RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates libstdc+±12-dev
RUN DISTRIBUTION=${DISTRIBUTION:-ubuntu2204} &&
CUDA_ARCH_FOLDER=$([ “$(uname -m)” == “aarch64” ] && echo sbsa || echo x86_64) &&
CUDA_DOWNLOAD_URL=Index of /compute/cuda/repos &&
wget “${CUDA_DOWNLOAD_URL}/${DISTRIBUTION}/${CUDA_ARCH_FOLDER}/cuda-keyring_1.1-1_all.deb” &&
dpkg -i cuda-keyring_1.1-1_all.deb && version_suffix=11-8 &&
apt-get update && apt-get install -y --no-install-recommends
cuda-nvtx-${version_suffix} cuda-cudart-${version_suffix}
libcusolver-${version_suffix} libcublas-${version_suffix}

RUN apt-get install -y --no-install-recommends libmpich-dev
ENV MPI_PATH=/usr/lib/${arch}-linux-gnu/mpich

RUN apt-get install -y --no-install-recommends python3 python3-pip &&
python3 -m pip install cuda-quantum
RUN cudaq_version=python3 -c "import cudaq; print(cudaq.__version__)" | grep -o '[0-9]\+\(\.[0-9]\+\)\+' &&
wget https://github.com/NVIDIA/cuda-quantum/releases/download/${cudaq_version}/install_cuda_quantum.$(uname -m) &&
chmod +x install_cuda_quantum.$(uname -m) && bash install_cuda_quantum.$(uname -m) --accept &&
mkdir -p ~/cuda_quantum/tutorials && cd ~/cuda_quantum && tmpdir=“$(mktemp -d)” &&
wget https://github.com/nvidia/cuda-quantum/archive/refs/tags/${cudaq_version}.tar.gz &&
tar xf ${cudaq_version}.tar.gz --strip-components 1 -C “${tmpdir}” && cp -Lr “${tmpdir}/examples” ~/cuda_quantum &&
mv ~/cuda_quantum/examples/python/tutorials ~/cuda_quantum/tutorials &&
rm -rf ${cudaq_version}.tar.gz “${tmpdir}” /install_cuda_quantum.$(uname -m)
RUN py_cudaq_dir=“$(python3 -m pip show cuda-quantum | grep -e ‘Location: .*$’ | cut -d ’ ’ -f2)” &&
CXX=/opt/nvidia/cudaq/bin/nvq++ bash “$py_cudaq_dir/distributed_interfaces/activate_custom_mpi.sh”

Additional tools for development

RUN apt-get install -y --no-install-recommends vim jq libpython3-dev
RUN python3 -m pip install numpy matplotlib
RUN OMPI_CC=“/opt/nvidia/cudaq/bin/clang” OMPI_CXX=“/opt/nvidia/cudaq/bin/clang++”
python3 -m pip install mpi4py

In the future, we would like to scale up to multiple gpus

Hi @ziqinguse. There are two primary options. 1) The change was approved and should be included in our nightly Docker release tomorrow. You can use the Docker image found here. 2) Alternatively, you can install from the source code linked in the PR above.

Hi I understood. For the nightly version: CUDA Quantum (nightly) | NVIDIA NGC
I cannot see the details of layers. May I know where to check the Dockerfile?

Hi @ziqinguse, there should be a “layers” tab next to the “overview” and “tags” tabs on the top of the page. This tab should list the Docker layers. Please let me know if this is not what you are seeing.

Thanks. I figured it out by hovering my cursor over the layer.