Sorting clarification for markdup

Does Parabricks markdup assume/require the input bam to be coordinate or queryname sorted?

The documentation is contradictory- it first says it supports both ways, with coordinate being the default and queryname an option if you use –markdups-assume-sortorder-queryname, but then there are 3 statements later in the documentation saying “The input BAM/CRAM must be sorted by queryname”.

Can you clarify which is the default and/or required option for sorting of input bams?

1 Like

Would also love some clarification here. I attempted to run a standalone markdups on a BAM file that I had sorted by coordinate using samtools and am seeing the following error:

[PB Info 2026-Feb-26 02:10:03] ------------------------------------------------------------------------------
[PB Info 2026-Feb-26 02:10:03] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2026-Feb-26 02:10:03] ||                              Version 4.5.0-1                             ||
[PB Info 2026-Feb-26 02:10:03] ||                                premarkdup                                ||
[PB Info 2026-Feb-26 02:10:03] ------------------------------------------------------------------------------
[PB Warning 2026-Feb-26 02:10:03][src/PreMarkdupMain.cpp:226] Starting premarkdup, the input bam should already be sorted by queryname
[PB Info 2026-Feb-26 02:10:03] ProgressMeter -  Elapsed-Minutes Reads-Processed Reads/Minute
[PB Info 2026-Feb-26 02:10:03] 0.100000 100     16
[PB Info 2026-Feb-26 02:10:03] chunk queue size:0       free queue size:0
[PB ESC[31mErrorESC[0m 2026-Feb-26 02:10:03][src/PreMarkdupMain.cpp:132] Input file is not sorted by queryname, exiting.

Rereading the documentation, it seems that perhaps this line:

markdup supports the marking of duplicates in two ways, assuming the sort order to be coordinate (the default) or queryname (--markdups-assume-sortorder-queryname).

is referring to the sorting that happens during the markdups phase being via coordinate unless directed to sort by queryname via the flag but that the input file still needs to be sorted by query name

Looking at the fq2bam documentation though, it seems like the equivalent GATK commands would be the following:

Run bwa-mem and pipe the output to create a sorted BAM.

$ bwa mem 
-t 32 
-K 10000000 
-R ‘@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1’ 
<INPUT_DIR>/${REFERENCE_FILE} <INPUT_DIR>/${INPUT_FASTQ_1} <INPUT_DIR>/${INPUT_FASTQ_2} | 
gatk SortSam 
–java-options -Xmx30g 
–MAX_RECORDS_IN_RAM 5000000 
-I /dev/stdin 
-O cpu.bam 
–SORT_ORDER coordinate

Mark duplicates.

$ gatk MarkDuplicates 
–java-options -Xmx30g 
-I cpu.bam 
-O mark_dups_cpu.bam 
-M metrics.txt

Generate a BQSR report.

$ gatk BaseRecalibrator 
–java-options -Xmx30g 
–input mark_dups_cpu.bam 
–output <OUTPUT_DIR>/${OUTPUT_RECAL_FILE} 
–known-sites <INPUT_DIR>/${KNOWN_SITES_FILE} 
–reference <INPUT_DIR>/${REFERENCE_FILE}

And in this equivalent pipeline SortSam very clear is outputting the bam as a coordinate-sorted file, but if this were then passed to parabricks MarkDuplicates it would crash

Hi,

The input BAM to our tool markdup must be sorted by queryname. The behavior of GATK MarkDuplicates has slightly different results based on whether the input BAM to GATK MarkDuplicates was coordinate or queryname sorted. By default we will match the output of GATK MarkDuplicates as if it were run with a coordinate sorted BAM. When using --markdups-assume-sortorder-queryname we will match the output of GATK MarkDuplicates as if it were run with a queryname sorted BAM.

Hope this clears it up.

1 Like

Does it truly need to be queryname sorted or can it be collated (pairs next to each other)?

Hello @jacob.hagen , yes it needs to be queryname sorted. We do a check internally and I believe it would fail if it weren’t fully sorted.

Also, we have updated the documentation with our new v4.7.0 release to be more clear. markdup - NVIDIA Docs