Clara-parabricks 4.0.0-1 and 4.2.0-1: STAR equivalent parameters? no speed up observed in rna_fq2bam compared to equivalent STAR

following the documentation for rna_fq2bam: rna_fq2bam - NVIDIA Docs
couldn’t locate equivalent parameter to
–outSAMtype

although the documentation does suggest following commands are equivalent:

$ docker run --rm --gpus all --volume <INPUT_DIR>:/workdir --volume <OUTPUT_DIR>:/outputdir
    -w /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 \
    pbrun rna_fq2bam \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --genome-lib-dir /workdir/${PATH_TO_GENOME_LIBRARY}/ \
    --output-dir /outputdir/${PATH_TO_OUTPUT_DIRECTORY} \
    --ref /workdir/${REFERENCE_FILE} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --read-files-command zcat
# STAR Alignment
$ ./STAR \
      --genomeDir <INPUT_DIR>/${PATH_TO_GENOME_LIBRARY} \
      --readFilesIn <INPUT_DIR>/${INPUT_FASTQ_1} <INPUT_DIR>/${INPUT_FASTQ_2} \
      --outFileNamePrefix <OUTPUT_DIR>/${PATH_TO_OUTPUT_DIRECTORY}/ \
      --outSAMtype BAM SortedByCoordinate \
      --readFilesCommand zcat

so I assume these parameters are set to be identical in STAR and rna-fq2bam calls on paired-end 101bp FASTQ. I do see 8-9x speedup running these versions on the cluster with 4 GPU node 120GB RAM. (45mins with STAR on CPU vs 5-7min with rna_fq2bam on GPU)

However, using the parameters in our pipeline both GPU and CPU took 1.5hrs (longer because FASTQ is 150bp ) to process the same dataset. Would anyone please help out with what might cause this and how to address this issue. I need to keep the same parameters (with --sjdb-overhang 149 and --two-pass-mode Basic) to match the pipeline. Please see the commands run below.

singularity exec --nv ${workdir}${SINGULARITY} /bin/bash -c " nvidia-smi;
	pbrun rna_fq2bam --num-threads ${NUM_THREADS} \
	--max-bam-sort-memory ${MAX_BAM_SORT_MEMORY} \
	\
	--two-pass-mode  ${TWO_PASS_MODE} \
	--genome-lib-dir ${PATH_TO_GENOME_LIBRARY} \
	--ref ${REFERENCE_FILE} \
	--output-dir ${PATH_TO_OUTPUT_DIRECTORY} \
	--out-bam ${PATH_TO_OUTPUT_DIRECTORY}${OUTPUT_BAM} \
	 \
	--in-fq ${INPUT_FASTQ_1} /${INPUT_FASTQ_2} \
	--out-sam-unmapped Within \
	--out-sam-attributes Standard \
	--read-files-command zcat \
        --sjdb-overhang 149"

with STAR

alias STAR='~/Project/RNAseq/bulk_RNAseq_gpu/STAR_2_7_2a/STAR-2.7.2a/bin/Linux_x86_64/STAR'
STAR \
	 --runThreadN ${NUM_THREADS} \
	--limitBAMsortRAM ${MAX_BAM_SORT_MEMORY} \
	--runMode alignReads \
	--twopassMode ${TWO_PASS_MODE} \
	--genomeDir ${PATH_TO_GENOME_LIBRARY} \
	 \
	--outFileNamePrefix ${aligned_dir}${OUTPUT_BAM_CPU_prefix} \
	\
	--outSAMtype BAM SortedByCoordinate \
	--readFilesIn ${INPUT_FASTQ_1} ${INPUT_FASTQ_2} \
	--outSAMunmapped Within \
	--outSAMattributes Standard \
        --readFilesCommand zcat

Notice the STAR version is chosen to match the documetation. I am happy to post complete logs with details.

Hi @vaibhav.a.janve,

We don’t have outsamtype in Parabricks becase we always sort the bam before returning it, we never give unsorted. And it was always be either a BAM or CRAM.

You are also correct that the speedup is the same for this tool between Parabricks 4.0 and 4.2.

Thank you @gburnett let me re-run and post the run times and parameters for STAR and rna-fq2bam. The speedups are smaller than expected (8-9x).

the rna_fq2bam errors out after 3.5 hrs now. Not sure whats wrong.

singularity exec --nv /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/clara-parabricks_4.2.0-1.sif /bin/bash -c
nvidia-smi;
pbrun rna_fq2bam --num-threads 4 --max-bam-sort-memory 0 --two-pass-mode Basic --genome-lib-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/genome_index/GRCh38_r110_bp150/ --ref /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/101bp_r110/Homo_sapiens.GRCh38.110.dna_sm.primary_assembly.fa --output-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/VMAP_test/4677-LD-100_S1_L001/Aligned/ --out-bam /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/VMAP_test/4677-LD-100_S1_L001/Aligned/GPU_4677-LD-100_S1_L001_2024-01-12T00:21:44-0600.bam --in-fq /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/VMAP_test/fastq/4677-LD-100_S1_L001/4677-LD-100_S1_L001_R1_001.fastq.gz //nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/VMAP_test/fastq/4677-LD-100_S1_L001/4677-LD-100_S1_L001_R2_001.fastq.gz --out-sam-unmapped Within --out-sam-attributes Standard --read-files-command zcat --sjdb-overhang 149 --verbose --x3 --logfile /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/VMAP_test/4677-LD-100_S1_L001/Aligned/clara-parabricks_4p2p01-1_4677-LD-100_S1_L001_150bp_r110_2024-01-12T00:21:45-0600.log

The Log:
[PB Info 2024-Jan-12 09:23:22] ------------------------------------------------------------------------------
[PB Info 2024-Jan-12 09:23:22] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2024-Jan-12 09:23:22] || Version 4.2.0-1 ||
[PB Info 2024-Jan-12 09:23:22] || star ||
[PB Info 2024-Jan-12 09:23:22] ------------------------------------------------------------------------------
[PB Info 2024-Jan-12 09:23:22] … started STAR run
[PB Info 2024-Jan-12 09:23:22] … loading genome
[PB Info 2024-Jan-12 09:24:57] read from genomeDir done 94.447
[PB Info 2024-Jan-12 09:24:57] Gpu num:4 Cpu thread num: 4
[PB Info 2024-Jan-12 09:24:57] … started 1st pass mapping
[PB Info 2024-Jan-12 09:25:07] gpu free memory: 45786988544, total 51041271808
cudaSuccess
[PB Info 2024-Jan-12 09:25:07] gpu free memory: 36823760896, total 51041271808
cudaSuccess
[PB Info 2024-Jan-12 09:25:08] setGpuConstMem: gpu const memory usage 193 bytes
[PB Info 2024-Jan-12 09:25:08] cudaSuccess
[PB Info 2024-Jan-12 09:25:12] gpu free memory: 45772308480, total 51041271808

[PB Info 2024-Jan-12 09:26:59] GPU_0
[PB Info 2024-Jan-12 09:26:59] ReadQueue:10, Stage0:10, Stage1:3, Stage1InnerVec:195, Stage1InnerChunk:1, copyQ:5, poolQ:2, Stage2:14,
[PB Info 2024-Jan-12 09:26:59] Thread 0 windowPool:105, transPool:400, Thread 1 windowPool:414, transPool:400, Thread 2 windowPool:119, transPool:400, Thread 3 windowPool:205, transPool:400,
[PB Info 2024-Jan-12 09:26:59] GPU_1
[PB Info 2024-Jan-12 09:26:59] ReadQueue:10, Stage0:8, Stage1:2, Stage1InnerVec:2000, Stage1InnerChunk:10, copyQ:6, poolQ:0, Stage2:6,
[PB Info 2024-Jan-12 09:26:59] Thread 0 windowPool:138, transPool:400, Thread 1 windowPool:249, transPool:400, Thread 2 windowPool:16, transPool:400, Thread 3 windowPool:132, transPool:400,
[PB Info 2024-Jan-12 09:26:59] GPU_2
[PB Info 2024-Jan-12 09:26:59] ReadQueue:10, Stage0:10, Stage1:7, Stage1InnerVec:190, Stage1InnerChunk:1, copyQ:5, poolQ:2, Stage2:11,
[PB Info 2024-Jan-12 09:26:59] Thread 0 windowPool:200, transPool:400, Thread 1 windowPool:16, transPool:400, Thread 2 windowPool:69, transPool:400, Thread 3 windowPool:77, transPool:400,
[PB Info 2024-Jan-12 09:26:59] GPU_3
[PB Info 2024-Jan-12 09:26:59] ReadQueue:10, Stage0:10, Stage1:0, Stage1InnerVec:1997, Stage1InnerChunk:10, copyQ:5, poolQ:2, Stage2:10,
[PB Info 2024-Jan-12 09:26:59] Thread 0 windowPool:95, transPool:400, Thread 1 windowPool:94, transPool:400, Thread 2 windowPool:2, transPool:400, Thread 3 windowPool:29, transPool:400,
[PB Info 2024-Jan-12 09:26:59] PoolQ:2
[PB Info 2024-Jan-12 09:27:45] ReadingReads:376600
For technical support visit Clara Parabricks v4.2.0 - NVIDIA Docs
Exiting…

Could not run rna_fq2bam
Exiting pbrun …
Time elapsed: 3:24:54
CPU usage: 580%
Maximum Resident set(kB): 92456496

any insights?