Fq2bam Marking Duplicates, BQSR - high memory use, job killed OOM

wilsonte · July 15, 2023, 12:30pm

Because v4.1.1-1 isn’t handling --align-only and --no-markdups as expected for me (see my other recent posts), fq2bam is running Marking Duplicates, BQSR. However, here I run into memory problems.

Consistent with the stated hardware requirements, I queued a job using Slurm as follows:

#SBATCH --cpus-per-task=24
#SBATCH --gpus-per-task=2
#SBATCH --mem-per-cpu=5g 
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

After properly aligning and sorting, it logged:

[PB Info 2023-Jul-14 12:04:31] ------------------------------------------------------------------------------
[PB Info 2023-Jul-14 12:04:31] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2023-Jul-14 12:04:31] ||                              Version 4.1.1-1                             ||
[PB Info 2023-Jul-14 12:04:31] ||                         Marking Duplicates, BQSR                         ||
[PB Info 2023-Jul-14 12:04:31] ------------------------------------------------------------------------------
[PB Info 2023-Jul-14 12:04:57] progressMeter -  Percentage
[PB Info 2023-Jul-14 12:05:07] 0.9       14.09 GB
[PB Info 2023-Jul-14 12:05:17] 1.6       26.09 GB
[PB Info 2023-Jul-14 12:05:27] 2.4       37.58 GB
[PB Info 2023-Jul-14 12:05:37] 3.2       48.73 GB
[PB Info 2023-Jul-14 12:05:47] 4.5       60.46 GB
[PB Info 2023-Jul-14 12:05:57] 6.0       70.30 GB
[PB Info 2023-Jul-14 12:06:07] 7.0       80.88 GB
[PB Info 2023-Jul-14 12:06:17] 8.7       91.20 GB
[PB Info 2023-Jul-14 12:06:27] 10.1      100.41 GB
[PB Info 2023-Jul-14 12:06:37] 11.7      111.54 GB
slurmstepd: error: Detected 1 oom_kill event in StepId=55846256.batch. Some of the step tasks have been OOM Killed.

Since the job was queued with net 5 x 24 = 120G CPU RAM, it is apparent that memory usage was accumulating and the job killed when it hit the job limit. This raises two issues/questions:

Why does Marking Duplicates, BQSR use so much memory? Is it expected?
More importantly, is there a mechanism to tell fq2bam how much memory is available to it?

The job is running on a shared node, and I do not necessarily have access to all RAM on the machine. Even if I try to ensure that I am the only job running on a node, I still need to provide a memory request, and thus will have a job memory limit.

I read in other posts about an option called --memory-limit, but I do not see it in current documentation so assume it was dropped in recent versions? It seems an important option to me.

(as an aside, for me this issue would become moot if --align-only worked, I don’t actually want to do sorting or dup marking)

wilsonte · July 15, 2023, 12:44pm

While I still think --memory-limit is a valuable option to have, I may have found a workaround for shared clusters.

From Slurm sbatch docs:

–exclusive[={user|mcs}]
The job allocation can not share nodes with other running jobs … the job is allocated all CPUs and GRES on all nodes in the allocation, but is only allocated as much memory as it requested. This is by design to support gang scheduling, because suspended jobs still reside in memory. To request all the memory on a node, use –mem=0.

The last bit I had missed previously - so, it is possible to request all memory on a node without having to provide a specific mem number.

wilsonte · July 16, 2023, 4:40pm

A further discovery, rather by accident. To make #SBATCH --exclusive work meaningfully on our cluster, I had to switch to V100 GPUs with 16G memory (prior were A40), so I added the --low-memory flag. Unexpectedly, one or more of those changes altered the behavior of Marking Duplicates, BQSR. CPU memory no longer accumulates progressively, so no OOM:

[PB Info 2023-Jul-15 22:51:39] ------------------------------------------------------------------------------
[PB Info 2023-Jul-15 22:51:39] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2023-Jul-15 22:51:39] ||                              Version 4.1.1-1                             ||
[PB Info 2023-Jul-15 22:51:39] ||                         Marking Duplicates, BQSR                         ||
[PB Info 2023-Jul-15 22:51:39] ------------------------------------------------------------------------------
[PB Info 2023-Jul-15 22:53:04] progressMeter -  Percentage
[PB Info 2023-Jul-15 22:53:14] 0.2       1.87 GB
[PB Info 2023-Jul-15 22:53:24] 0.8       3.01 GB
[PB Info 2023-Jul-15 22:53:34] 1.9       1.43 GB
[PB Info 2023-Jul-15 22:53:44] 2.7       1.47 GB
[PB Info 2023-Jul-15 22:53:54] 3.1       1.86 GB
[PB Info 2023-Jul-15 22:54:04] 3.8       2.98 GB
[PB Info 2023-Jul-15 22:54:14] 4.5       3.37 GB
[PB Info 2023-Jul-15 22:54:24] 5.5       1.84 GB
...
[PB Info 2023-Jul-15 23:13:44] 97.9      5.48 GB
[PB Info 2023-Jul-15 23:13:54] 98.0      4.35 GB
[PB Info 2023-Jul-15 23:14:04] 98.1      4.08 GB
[PB Info 2023-Jul-15 23:14:14] 100.0     0.00 GB
[PB Info 2023-Jul-15 23:14:14] BQSR and writing final BAM:  1270.353 seconds

It would be great if someone had insight what is happening with this, it doesn’t appear to be documented.

dpuleri · July 2, 2024, 4:39pm

Hello. A few notes:

You can see memory limit in the documentation: fq2bam (BWA-MEM + GATK) - NVIDIA Docs. Scroll down to performance options. The default is actually half of the system installed memory. So, depending on the configuration with slurm we may detect more than what slurm has allocated to you. If you are having issues with host memory usage then I would suggest to provide a certain number of GB manually. For example if you are going to have 60GB allocated to you then to be safe you can provide --memory-limit 40.
--low-memory is a low memory mode for GPUs such as V100 with 16GB of device memory.
What is the issue with --align-only?

Topic		Replies	Views
Fq2bam - Marking Duplicates, BQSR executing despite --no-markdups, etc Parabricks	0	775	July 14, 2023
Out-of-memory errors running pbrun fq2bam through singularity on A100s via slurm Parabricks ai	2	1299	January 19, 2023
Fq2bam --align-only flag leads to segmentation fault Parabricks ai	0	743	July 13, 2023
[Nvidia/Parabricks] got an error on running Marking Duplicates (with the official Parabricks samples) Parabricks	5	1195	October 12, 2021
Fq2bam Error Received signal: 11 Parabricks cuda , ai	3	1468	May 4, 2023
Fq2bam on GCP Parabricks ai , fq2bam	10	32	December 11, 2024
Problem with gpu Parabricks ai	12	2230	November 1, 2024
Clara-parabricks:4.3.1 fq2bam_meth is slow Parabricks ai	1	132	July 11, 2024
"Could not run fq2bam" Is the only verbose output from Parabricks 4.4.0-1 and 4.3.2-1 on tutorial data Parabricks ai , demos-and-tutorials , fq2bam	7	63	December 4, 2024
Fq2bam rel3-nanopore-wgs-288418386-FAB39088.fastq.gz Parabricks	4	1212	January 29, 2024

Fq2bam Marking Duplicates, BQSR - high memory use, job killed OOM

Related topics