I’m attempting to align a pair of rather large FastQ files (960GB in total) and my pipeline is crashing out when fq2bam hits the MarkDuplicates stage due to excessive memory allocation causing a signal-kill that terminated my job.
I’ve attempted a few things to fix this, including:
Running fq2bam with the --low-memory flag
Running fq2bam with the --memory-limit flag set to 90GB when my job requests 100GB
Running fq2bam with the --no-markdups flag
In all cases, fq2bam isn’t respecting the memory limits I’ve set for it and the --no-markdups flag doesn’t seem to be working at all as that run still crashed out when attempting to mark duplicates.
I’m using Parabricks 4.5.0, I don’t see anything in the release notes for more recent versions addressing any of the above but will try updating to 4.6.0 and see if that helps.
Update: upgrading to Parabricks 4.6.0 did not fix the problem (although the BWA stage did get faster!) – the --no-markdups flag is still being ignored and the job is exceeding its allowed memory
[PB Info 2026-Jan-07 06:43:36] Sorting and Marking: 1670.335 seconds
[PB Info 2026-Jan-07 06:43:36] ------------------------------------------------------------------------------
[PB Info 2026-Jan-07 06:43:36] || Program: Sorting Phase-II ||
[PB Info 2026-Jan-07 06:43:36] || Version: 4.5.0-1 ||
[PB Info 2026-Jan-07 06:43:36] || Start Time: Wed Jan 7 06:15:45 2026 ||
[PB Info 2026-Jan-07 06:43:36] || End Time: Wed Jan 7 06:43:36 2026 ||
[PB Info 2026-Jan-07 06:43:36] || Total Time: 27 minutes 51 seconds ||
[PB Info 2026-Jan-07 06:43:36] ------------------------------------------------------------------------------
[PB Info 2026-Jan-07 06:43:36] ------------------------------------------------------------------------------
[PB Info 2026-Jan-07 06:43:36] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2026-Jan-07 06:43:36] || Version 4.5.0-1 ||
[PB Info 2026-Jan-07 06:43:36] || Marking Duplicates, BQSR ||
[PB Info 2026-Jan-07 06:43:36] ------------------------------------------------------------------------------
[PB Info 2026-Jan-07 06:43:36] BQSR using CUDA device(s): { 0 }
[PB Info 2026-Jan-07 06:43:37] Using PBBinBamFile for BAM writing
[PB Info 2026-Jan-07 06:43:37] progressMeter - Percentage
[PB Info 2026-Jan-07 06:43:47] 0.0
Process terminated with signal [SIGKILL: 9]. SIGKILL cannot be caught. A common reason for SIGKILL is running out of
host memory. If the user has root access, they may be able to check by running `sudo journalctl -k --since "<#> minutes
ago" | grep "Killed process"` to see the reason why processes were recently killed.
For technical support visit https://docs.nvidia.com/clara/index.html#parabricks
Exiting...
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
Could not run fq2bam
Exiting pbrun ...
INFO: Cleaning up image...
My experience has been that if you don’t have enough RAM to fit the entire library into memory at once it will crash. Did you have any luck with the --align-only flag?
align-only worked but haven’t had the time to test if I can reproduce the issue with bamsort. Strange thing is that I have aligned even larger paired-end libraries without any issues.
I was able to get --align-only working as well to at least produce an initial aligned BAM file. The pipeline still crashes at the next step when I attempt to run a standalone bamsort, which for some reason still reports a marking duplicates/BQSR phase after the initial sort is done
[PB Info 2026-Feb-19 19:28:53] Sorting and Marking: 240.106 seconds
[PB Info 2026-Feb-19 19:28:53] ------------------------------------------------------------------------------
[PB Info 2026-Feb-19 19:28:53] || Program: Sorting Phase-II ||
[PB Info 2026-Feb-19 19:28:53] || Version: 4.5.0-1 ||
[PB Info 2026-Feb-19 19:28:53] || Start Time: Thu Feb 19 19:24:53 2026 ||
[PB Info 2026-Feb-19 19:28:53] || End Time: Thu Feb 19 19:28:53 2026 ||
[PB Info 2026-Feb-19 19:28:53] || Total Time: 4 minutes 0 seconds ||
[PB Info 2026-Feb-19 19:28:53] ------------------------------------------------------------------------------
[PB Info 2026-Feb-19 19:29:03] ------------------------------------------------------------------------------
[PB Info 2026-Feb-19 19:29:03] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2026-Feb-19 19:29:03] || Version 4.5.0-1 ||
[PB Info 2026-Feb-19 19:29:03] || Marking Duplicates, BQSR ||
[PB Info 2026-Feb-19 19:29:03] ------------------------------------------------------------------------------
[PB Info 2026-Feb-19 19:29:03] Using PBBinBamFile for BAM writing
[PB Info 2026-Feb-19 19:29:03] progressMeter - Percentage
[PB Info 2026-Feb-19 19:29:13] 0.1
[PB Info 2026-Feb-19 19:29:23] 0.4
[PB Info 2026-Feb-19 19:29:33] 0.6
[PB Info 2026-Feb-19 19:29:43] 0.8
[PB Warning 2026-Feb-19 19:29:52][src/PBTempFile.cpp:155] Attempting to allocate host memory above desired limit (144.478831 GB)
[PB Info 2026-Feb-19 19:29:53] 1.1
[PB Info 2026-Feb-19 19:30:03] 1.2
[PB Info 2026-Feb-19 19:30:13] 1.4
[PB Info 2026-Feb-19 19:30:23] 1.7
[PB Info 2026-Feb-19 19:30:33] 2.0
[PB Info 2026-Feb-19 19:30:43] 2.3
[PB Info 2026-Feb-19 19:30:53] 2.3
[PB Info 2026-Feb-19 19:31:03] 2.3
[PB Info 2026-Feb-19 19:31:13] 2.3
[PB Info 2026-Feb-19 19:31:23] 2.3
[PB Info 2026-Feb-19 19:31:33] 2.3
[PB Info 2026-Feb-19 19:31:43] 2.3
[PB Info 2026-Feb-19 19:31:53] 2.3
[PB Info 2026-Feb-19 19:32:03] 2.3
[PB Info 2026-Feb-19 19:32:13] 2.3
[PB Info 2026-Feb-19 19:32:23] 2.3
[PB Info 2026-Feb-19 19:32:33] 2.3
[PB Info 2026-Feb-19 19:32:43] 2.3
[PB Info 2026-Feb-19 19:32:53] 2.3
[PB Info 2026-Feb-19 19:33:03] 2.3
[PB Info 2026-Feb-19 19:33:13] 2.3
[PB Info 2026-Feb-19 19:33:23] 2.3
[PB Info 2026-Feb-19 19:33:33] 2.3
[PB Info 2026-Feb-19 19:33:43] 2.3
[PB Info 2026-Feb-19 19:33:53] 2.3
[PB Info 2026-Feb-19 19:34:03] 2.3
[PB Info 2026-Feb-19 19:34:13] 2.3
[PB Info 2026-Feb-19 19:34:23] 2.3
[PB Info 2026-Feb-19 19:34:33] 2.3
[PB Info 2026-Feb-19 19:34:43] 2.3
[PB Info 2026-Feb-19 19:34:53] 2.4
[PB Info 2026-Feb-19 19:35:03] 2.5
[PB Info 2026-Feb-19 19:35:13] 2.9
Process terminated with signal [SIGKILL: 9]. SIGKILL cannot be caught. A common reason for SIGKILL is running out of
host memory. If the user has root access, they may be able to check by running `sudo journalctl -k --since "<#> minutes
ago" | grep "Killed process"` to see the reason why processes were recently killed.
For technical support visit https://docs.nvidia.com/clara/index.html#parabricks
Exiting...
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
Could not run bamsort
Exiting pbrun ...
I’ve confirmed via subsequent runs with the --verbose flag that CPU memory usage immediately spikes to 97-98% at this stage before eventually being over-allocated and crashing
How much available host memory is there on your system? Both installed and actually available before running anything.
What value are you setting for --memory-limit?
I would be conservative when setting this memory limit. We try to respect it but as the code says in a warning it is more of a soft limit which we can go over slightly depending on a few scenarios. If you do not provide a value for --memory-limit then we default to half of the installed memory. By installed memory I am referring to the value of MemTotal that is shown in /proc/meminfo. I see that you are using singularity so you are probably on a shared cluster. If that is the case then --memory-limit should be set judiciously because there is no easy way for us to know what parameters you provided to slurm or your job scheduler of choice.
What GPU are you using? I see that you are using --low-memory. The parameter --low-memory refers to low VRAM or device memory.
Also, the banner for the third stage where it says “Marking Duplicates, BQSR” is a bit of a misnomer. It will always print that message in the banner for the third stage whether or not you are Marking Duplicates or doing BQSR. If you are not doing those two operations then the main thing that stage is doing is finishing coordinate sorting and writing the final BAM or CRAM file.
You may also want to try using --gpuwrite as that can make the final stage faster by using the GPU to prepare the BAM file.
So the shared cluster I’m using has two nodes types available that I send alignment jobs to that have either 4 A100 80G with 500GB of CPU RAM or 4 H100’s with 248GB of CPU RAM. However, because this is a shared cluster I’m not generally able to monopolize a whole node – I’m fairly consistently able to submit jobs with 200GB of RAM and 2 GPUs to either node type (and our H200 nodes are being upgraded to also have 500GB of RAM). However, my current project has a library that contains 2TB of raw fastQ data and has aligned into a 480GB bam file. Running the pipeline with 450GB of CPU ram and two A100s still resulted in the crash
So in my tests with the --memory-limit flag I was setting the limit to 90GB in a job with 100GB of total memory allocated, if it’s a soft limit as you say I can see how this isn’t enough of a buffer. However, if the default when the flag isn’t specified is half of the system memory, that should be a default limit of 124GB on a node that has 248GB of RAM installed (confirmed that this is what shows as the MemTotal in /proc/meminfo). For a job that has 200GB of ram allocated, it seems surprising that it would blow so far past the default limit of 124GB to still crash, that’s more than a 60% spike over the limit. I’ll try again with a much more conservative memory limit and see how it does
The jobs are run either with either 2 A100 80GBs or 2 H200s, I’ve since dropped the --low-memory flag as I realized the property was for GPU ram
Great to know re: the misnaming, it would be awesome if the logging could be made more clear there. I’d been doing a separate unmarkduplicates step for non-PCR libraries because it appeared that duplicates were being marked regardless of whether I wanted them to be or not
I’ll share my most current run configurations in case something jumps out as obviously wrong. This node has 248GB of RAM available, per /proc/meminfo’s MemTotal, and this job is taking in 2TB of FastQ data:
#!/bin/bash #SBATCH -t 24:00:00 #SBATCH --signal=SIGTERM@900 # give time for the cleanup script to avoid draining the node on job termination #SBATCH -M HPC4 #SBATCH -p gpu-h200 #SBATCH -J parabricks #SBATCH -o parabricks_%A.log #SBATCH --gres=gpu:h200:2 #SBATCH --mem=200G #SBATCH -c 32
Yes, setting a limit of 90GB when only 100GB are available may be too tight to account for overages. A good rule of thumb would be to set the --memory-limit to half of the available memory you wish to allow for the job.
Great to know re: the misnaming, it would be awesome if the logging could be made more clear there. I’d been doing a separate unmarkduplicates step for non-PCR libraries because it appeared that duplicates were being marked regardless of whether I wanted them to be or not
Thanks for the feedback. We will note this for the future.
Two more questions.
Would it be possible to try exclusive jobs on your cluster? In my experience for clusters many of them do not require the user to set a memory limit so even if you set one, another user could get scheduled on the same node without a memory limit applied to their job.
Do your FASTQs have very deep coverage in certain parts of the genome? To parallelize the coordinate sorting we break up the alignments by parts of chromosomes so if you have very deep coverage in certain parts of the genome it could make that group of alignments very large. We do not have any parameters for you to adjust this behavior but it would be good to know for the future if this is a dataset which does not work well with our algorithm.
One more comment. Earlier you wrote
I’ve confirmed via subsequent runs with the --verbose flag that CPU memory usage immediately spikes to 97-98% at this stage before eventually being over-allocated and crashing
Did you mean that you were using the ‑‑monitor‑usage option? If so, that would be interesting if you saw 97% memory usage while not having exclusive access because that would mean another job scheduled to the same node is using a substantial amount of memory. For that metric we are using values from /proc/meminfo on available memory and that should be on a per-node basis, not per-job. This is because we do not have visibility in to your cluster’s scheduling and how it may partition a node.
Finally, your run command looks good except I would set --memory-limit 100 so that it is half of what you are asking from slurm.
So unfortunately I am not able to run exclusive jobs on my cluster – our QOS limits are such that an exclusive job will never be picked up by the scheduler as it attempts to allocate more resources than are allowed per-job.
The FASTQs for this project are whole-genome sequencing. In theory, they should be fairly uniform in coverage but in practice there are always spikes in certain parts of the genome, particularly for repetitive regions or regions with high similarity to other parts of the genome where you get a lot of lower-fidelity reads mapping onto those areas.
If I’m understanding you correctly, you’re saying that there might be so many reads mapping to the same region that the chunk of reads for that area during coordinate sorting is so large that it’s going to overwhelm the available memory tying to load it all at once? If that’s the case then yeah it might require an adjustment at the the algorithm level to do chunking dynamically based on number of reads in a region rather than presumably just dividing the dataset into X chunks where each chunk covers a certain region and then you have inconsistency in the size of those chunks that can lead to spikes.
And yes I did mean --monitor-usage, interesting to note on the spikes coming from multiple jobs. I’m curious if that other job crashed when mine did, though I have no insight there.
The alignment did still crash with --memory-limit 100, I’m trying it again on one of the A100 nodes with 450G of RAM allocated and --memory-limit 100 and we’ll see how that performs. If those configurations aren’t sufficient to get through coordinate sorting, my next options are either:
Coordinate-sort the aligned BAM generated by --align-only using a CPU run of GATK SortSam and then feed that sorted BAM to Parabricks markdups and continue the GPU pipeline from there
Split the unaligned FASTQ files into multiple chunks, run fq2bam --no-markdups on the chunked inputs, merge the coordinate sorted BAM files, and then continue with GPU markdups from there on the merged sorted BAM
Neither option is ideal in terms of automation and infrastructure scalability but they’ll get me through this project at least
Update: the pipeline still crashed with 450G of RAM and a memory limit of 100G
I ran a CPU queryname sort and attempted to resume the GPU pipeline with a standalone markdups stage, which also crashed due to overallocation of memory
Sorry for the late reply. It looks like your run is violating an assumption we have in our code if 400GB does not seem to be enough. Is that data you are using public or is there a similar version of your data that is publicly available? An SRA accession number or EBI dataset would be great so I can reproduce your runs and see in more detail what is happening. Also, for completeness sake can you post your final run command so I can do a run with the same parameters?
I ran a CPU queryname sort and attempted to resume the GPU pipeline with a standalone markdups stage, which also crashed due to overallocation of memory
Which tool did you use for this? How much CPU memory did you allocate to that tool?
The dataset is non-public and I’d be hard-pressed to identify a comparable one but I can share it privately with you to help with internal testing. It’s very large (2TB raw FastQ) so we’ll need to coordinate on data transfer via an AWS bucket or something similar.
The reference genome used here is mus musculus GRCm38.
You can find my initial Parabricks command and relevant slurm configurations here:
I ultimately had to do sorting, duplicate marking, and BQSR generation/application via CPU workflows for this dataset before resuming GPU processing with the HaplotypeCaller stage, which was successful
Hi @PaulMatterBioDev , thank you for the confirmation on your workflow. That helps to narrow it down a bit to what part of the pipeline could be causing issues on our end.
Thanks for being willing to share your data. I’ll send you a private message.