Out-of-memory errors running pbrun fq2bam through singularity on A100s via slurm

Hello,

I am trying to run fq2bam through pbrun on my university’s HPC, which uses slurm scheduling, but it is failing with CUDA out-of-memory errors.

I am using 4 A100s and 256gb system ram, which I ask for in the slurm script like so:

#SBATCH --time=12:00:00
#SBATCH --ntasks=16
#SBATCH --mem=256g
#SBATCH --tmp=128g
#SBATCH -p a100-4
#SBATCH --gres=gpu:a100:4

In this script, I double-check that the GPUs are accessible, with nvidia-smi, which produces this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   22C    P0    50W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   23C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   23C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C7:00.0 Off |                    0 |
| N/A   24C    P0    61W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               

The rest of my script loads singularity and tries to run pbrun fq2bam (the variables in the singularity call are all to accurate file paths)

module load singularity
singularity run --bind ${FOLDERS_TO_BIND} --nv /software/singularity-images/clara-parabricks/4.0.0-1.sif \
    pbrun fq2bam \
        --ref $REFERENCE_PATH \
        --in-fq $FASTA_R1_PATH $FASTA_R2_PATH \
        --out-recal-file $OUTPUT_RECAL \
        --out-bam $OUTPUT_PATH \
        --knownSites $KNOWN_INDELS \
        --knownSites $KNOWN_INDELS2 \
        --knownSites $KNOWN_SNPS
        --tmp-dir $TMP_DIR

Unfortunately, the job fails with the following:

Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /scratch.global/chenzler/billi020/LR11_006_2_S3_R1_001.fastq.gz and
/scratch.global/chenzler/billi020/LR11_006_2_S3_R2_001.fastq.gz
[Parabricks Options Mesg]: @RG\tID:HVJMNDSX2.2\tLB:lib1\tPL:bar\tSM:sample\tPU:HVJMNDSX2.2
[PB Info 2023-Jan-18 11:10:01] ------------------------------------------------------------------------------
[PB Info 2023-Jan-18 11:10:01] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2023-Jan-18 11:10:01] ||                              Version 4.0.0-1                             ||
[PB Info 2023-Jan-18 11:10:01] ||                       GPU-BWA mem, Sorting Phase-I                       ||
[PB Info 2023-Jan-18 11:10:01] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Info 2023-Jan-18 11:12:16] GPU-BWA mem
[PB Info 2023-Jan-18 11:12:16] ProgressMeter    Reads           Base Pairs Aligned
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
 of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
 of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
 of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
 of memory
[PB Error 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:44] No GPUs active, shutting down due to previous error., exiting.
For technical support visit https://docs.nvidia.com/clara/parabricks/4.0.0/Help.html
Exiting...

I’m not sure what the issue is since I’ve requested a large amount of memory. I’ve tried reducing tasks (cpus), using 2 instead of 4 GPUs, etc., but haven’t figured it out. Thanks in advance for your help!

Hey @chaco001,

I can see that the GPUs are active on your node, but it seems like they’re not accessible inside the singularity image. I would try running nvidia-smi inside the Singularity image to see if that works. If it does, then I would try to run your Parabricks job. If not, there may be some missing flags in the singularity run command.

Thank you for your help! Two things. 1. I think I tried running nvidia-smi correctly within the image. I did this interactively on the node after asking for only two GPUs. Does this information suggest the GPUs are accessible within singularity?

singularity run --nv /software/singularity-images/
clara-parabricks/4.0.0-1.sif nvidia-smi
Thu Jan 19 15:34:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   25C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   24C    P0    69W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. Is there a way to limit GPU memory usage? If not, do you have a rough idea of how much memory I’d need for a fq2bam job with, e.g. 1M or 10M reads using the current human reference/index?