Hi NVIDIA Parabricks team,
I’m running Parabricks fq2bam as part of an nf-core/sarek-derived workflow on AWS Batch GPU instances, and I hit a CRAM integrity/indexing issue with one sample. The Parabricks fq2bam task completed successfully with exit code 0, but the downstream samtools index step failed on the CRAM produced by Parabricks.
Environment
- Parabricks container:
nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 - Tool:
pbrun fq2bam - Downstream indexer:
samtools 1.21 - Workflow: nf-core/sarek-derived Nextflow workflow
- Platform: AWS Batch / Seqera Platform
- Instance shape: GPU AWS Batch environment, using 4 GPUs
- Reference: GRCh38 / GATK bundle-style reference files
Parabricks command shape
The task used pbrun fq2bam with paired FASTQs, GRCh38 BWA index/reference, known sites for BQSR, and interval restriction. The relevant options were approximately:
pbrun fq2bam \
--ref <GRCh38_BWA_index_prefix> \
--in-fq <sample>_R1.fastq.gz <sample>_R2.fastq.gz \
--out-bam <sample>.cram \
--knownSites dbsnp_146.hg38.vcf.gz \
--knownSites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--knownSites Homo_sapiens_assembly38.known_indels.vcf.gz \
--out-recal-file <sample>.table \
--interval-file wgs_calling_regions_noseconds.hg38.bed \
--num-gpus 4 \
--bwa-cpu-thread-pool 48 \
--monitor-usage \
--read-group-id-prefix <sample>.L1 \
--read-group-sm <patient_sample> \
--read-group-lb <sample> \
--read-group-pl ILLUMINA \
--bwa-options='-K 100000000 -Y' \
--gpuwrite \
--gpusort \
--bwa-nstreams auto
The output extension was .cram, so Parabricks produced CRAM output.
Observed failure
Parabricks itself finished successfully, but downstream indexing failed:
samtools index -@ 0 22A0018864.cram
with:
[E::cram_index_container] CRAM slice offset 74642 does not match landmark 1 in container header (202490)
samtools index: failed to create index for "22A0018864.cram"
The failing CRAM was large, roughly 74,939,777,141 bytes. I was not able to download and inspect the full CRAM locally, but I did inspect the associated .crai and BQSR recalibration table. The BQSR table looked normal and showed that a large number of reads were processed, so this does not appear to be an obvious early task failure. The problem seems specific to the CRAM structure/indexability.
Why I suspect the CRAM output
The pipeline stage immediately upstream was Parabricks fq2bam, which completed successfully. The next stage was a standard samtools index of the Parabricks-produced CRAM. The failure message appears to be about inconsistent internal CRAM container/slice offsets rather than a missing file, truncated file, or reference mismatch.
Questions
- Is
pbrun fq2bamCRAM output expected to be fully compatible withsamtools indexfrom htslib/samtools1.21? - Are there known issues in Parabricks
4.7.0-1with CRAM output, especially when using--gpuwrite,--gpusort, and/or--bwa-nstreams auto? - Are there recommended settings for producing CRAM safely from
fq2bamat this scale? - Would you recommend avoiding CRAM output from
fq2bamand writing BAM directly, then converting/indexing with another tool if CRAM is required? - What additional diagnostics would be most useful if the full CRAM is too large to download? For example, would the
.crai,.command.log, BQSR table, or selected byte ranges from the CRAM be useful?
I can provide the Parabricks .command.log, the .crai, the BQSR recalibration table, and exact command/configuration details if useful. I cannot easily share the full CRAM due to size and data restrictions.
Thanks for any guidance on whether this is a known issue or if there are recommended Parabricks settings to avoid generating non-indexable CRAM output.