Can pbrun germline take stdin?

Benutzer1756 · November 18, 2021, 12:43pm

Hi,
according to GATK best practices unmapped reads should be stored as .ubam files.
pbrun or bwa do not take .bam as input.
Currently we do sth like this:

picard SamToFastq \
		I= input.ubam \
		FASTQ=/dev/stdout \
		CLIPPING_ATTRIBUTE=XT \
		CLIPPING_ACTION=2 \
		INTERLEAVE=true \
		NON_PF=true \
		TMP_DIR= tmp/ | \
               bwa mem -M -Y -t 16 -p ref/Homo_sapiens_assembly38.fasta /dev/stdin

INTERLEAVE=true generates an interleaved .fastq, which bwa accepts via -Y

I can not do the same in pbrun:

picard SamToFastq \
		I= input.ubam \
		FASTQ=/dev/stdout \
		CLIPPING_ATTRIBUTE=XT \
		CLIPPING_ACTION=2 \
		**INTERLEAVE=true** \
		NON_PF=true \
		TMP_DIR= tmp/ | \
pbrun germline --ref ref/Homo_sapiens_assembly38.fasta \
		                 --bwa-options "-Y"
				 --in-fq /dev/stdin
				 --knownSites ref/Homo_sapiens_assembly38.dbsnp138.vcf \
				 --knownSites ref/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
				 --out-bam out.bam \
				 --out-variants out.vcf \
				 --out-recal-file out.report.txt \
				 --out-duplicate-metrics out.duplicatemetrics.txt \
				 --tmp-dir /tmp \

Output:
[Parabricks Options Error]: Input file /dev/stdin not found. Exiting…

Anyone tried something similar yet?

Best,
Max

gburnett · November 29, 2021, 8:53pm

Hey @Benutzer1756,

I have not tested this, but my theory for why it doesn’t work is because the pbrun command is a wrapper. The work is done by a docker container. The container uses a different shell instance and therefore probably can’t read in from the original shell’s /dev/stdin

Benutzer1756 · November 30, 2021, 10:30am

Hi, thanks for your response. This is too bad, as the issue expands to variant calling and recalibration, which are not able to read or write gzipped vcfs. As g.vcfs can get quite big, this costs a lot of time and space.
Do you happen to know a workaround for this issue?

gburnett · November 30, 2021, 7:30pm

Hey @Benutzer1756, which variant callers are you trying to use that don’t read or write gzipped files?

Benutzer1756 · December 1, 2021, 1:19pm

Hi,
pbrun germline with --gvcf and --out-variants.g.vcf.gz produces no gzipped vcf.gz (which can get pretty large for WGS)
pbrun haplotypecaller on the other hand does gzip the output, if a g.vcf.gz is specified.

pbrun genotypegvcf accepts .g.vcf.gz but does not produce gzipped
pbrun vqsr does not accept gzipped vcf files, neither as input nor as training set.

I think working with ungzipped vcfs from haplotypecaller on is fine, as the size is managable (~1GB per 30xWGS) but the g.vcf produced by pbrun germline should be compressed.

edit: to come back to the original question: it would also be great if pbrun germline and fastq2bam accept unaligned .bam-files as input to concord with GATK best practices

Best,
Max

gburnett · December 7, 2021, 7:18pm

Hey @Benutzer1756,

Thank you for this feedback. This is not available in the current version of Parabricks, but I’ve sent this as a feature request to the engineering team for future versions.