Hello,
I am trying to run fq2bam through pbrun on my university’s HPC, which uses slurm scheduling, but it is failing with CUDA out-of-memory errors.
I am using 4 A100s and 256gb system ram, which I ask for in the slurm script like so:
#SBATCH --time=12:00:00
#SBATCH --ntasks=16
#SBATCH --mem=256g
#SBATCH --tmp=128g
#SBATCH -p a100-4
#SBATCH --gres=gpu:a100:4
In this script, I double-check that the GPUs are accessible, with nvidia-smi, which produces this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:06:00.0 Off | 0 |
| N/A 22C P0 50W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:47:00.0 Off | 0 |
| N/A 23C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:85:00.0 Off | 0 |
| N/A 23C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... Off | 00000000:C7:00.0 Off | 0 |
| N/A 24C P0 61W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
The rest of my script loads singularity and tries to run pbrun fq2bam (the variables in the singularity call are all to accurate file paths)
module load singularity
singularity run --bind ${FOLDERS_TO_BIND} --nv /software/singularity-images/clara-parabricks/4.0.0-1.sif \
pbrun fq2bam \
--ref $REFERENCE_PATH \
--in-fq $FASTA_R1_PATH $FASTA_R2_PATH \
--out-recal-file $OUTPUT_RECAL \
--out-bam $OUTPUT_PATH \
--knownSites $KNOWN_INDELS \
--knownSites $KNOWN_INDELS2 \
--knownSites $KNOWN_SNPS
--tmp-dir $TMP_DIR
Unfortunately, the job fails with the following:
Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /scratch.global/chenzler/billi020/LR11_006_2_S3_R1_001.fastq.gz and
/scratch.global/chenzler/billi020/LR11_006_2_S3_R2_001.fastq.gz
[Parabricks Options Mesg]: @RG\tID:HVJMNDSX2.2\tLB:lib1\tPL:bar\tSM:sample\tPU:HVJMNDSX2.2
[PB Info 2023-Jan-18 11:10:01] ------------------------------------------------------------------------------
[PB Info 2023-Jan-18 11:10:01] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jan-18 11:10:01] || Version 4.0.0-1 ||
[PB Info 2023-Jan-18 11:10:01] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2023-Jan-18 11:10:01] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Info 2023-Jan-18 11:12:16] GPU-BWA mem
[PB Info 2023-Jan-18 11:12:16] ProgressMeter Reads Base Pairs Aligned
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
of memory
[PB Warning 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/memoryManager.cu/80: out
of memory
[PB Error 2023-Jan-18 11:12:23][ParaBricks/src/check_error.cu:44] No GPUs active, shutting down due to previous error., exiting.
For technical support visit https://docs.nvidia.com/clara/parabricks/4.0.0/Help.html
Exiting...
I’m not sure what the issue is since I’ve requested a large amount of memory. I’ve tried reducing tasks (cpus), using 2 instead of 4 GPUs, etc., but haven’t figured it out. Thanks in advance for your help!