Hi
I am trying to run fq2bam using this tutorial here: Tutorials - NVIDIA Docs
I wanted to check if the moderator @gburnett and/or @andjoseph have tried out the stuff as mentioned in those tutorials on an AWS g4dn.12xlarge or any AWS g4 machine.
As experienced by @huyen.nguyen, I am also getting a similar error - I am using g4dn.12xlarge and nvidia/clara/clara-parabricks:4.1.1-1
Here is the nvidia-smi info
nvidia-smi
Fri Jun 23 06:06:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1B.0 Off | 0 |
| N/A 46C P0 28W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:1C.0 Off | 0 |
| N/A 43C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:00:1D.0 Off | 0 |
| N/A 45C P0 28W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 42C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This is the stderr of the fq2bam
job
pbrun fq2bam \
--in-fq "sample_1.fq.gz" "sample_2.fq.gz" \
--ref $INDEX \
--out-bam "Tutorial_Sample_1.pb.bam" \
--logfile "Tutorial_Sample_1.FQ2BAM.log.txt" \
--out-duplicate-metrics "Tutorial_Sample_1.duplicates_metrics.txt" \
--num-gpus 4
[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for parabricks_example_data/sample_1.fq.gz
and parabricks_example_data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2023-Jun-23 06:06:29] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:06:29] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jun-23 06:06:29] || Version 4.1.1-1 ||
[PB Info 2023-Jun-23 06:06:29] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2023-Jun-23 06:06:29] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Warning 2023-Jun-23 06:06:42][ParaBricks/src/pbOpts.cu:245]
WARNING
The system has 186 GB, however recommended RAM with 4 GPU is 196 GB.
The run might not finish or might have less than expected performance.
[PB Info 2023-Jun-23 06:06:43] GPU-BWA mem
[PB Info 2023-Jun-23 06:06:43] ProgressMeter Reads Base Pairs Aligned
[PB Warning 2023-Jun-23 06:06:55][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/samGenerator.cu/771: out of memory
[PB Warning 2023-Jun-23 06:06:55][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/samGenerator.cu/771: out of memory
[PB Warning 2023-Jun-23 06:06:55][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/samGenerator.cu/771: out of memory
[PB Warning 2023-Jun-23 06:06:55][ParaBricks/src/check_error.cu:41] cudaSafeCall() failed at ParaBricks/src/samGenerator.cu/771: out of memory
[PB e[31mErrore[0m 2023-Jun-23 06:06:55][ParaBricks/src/check_error.cu:44] No GPUs active, shutting down due to previous error., exiting.
For technical support visit https://docs.nvidia.com/clara/parabricks/4.1.0/Help.html
Exiting...
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
Could not run fq2bam
Exiting pbrun ...
However, if I use nvidia/clara/clara-parabricks:4.0.1-1, the above runs fine. Hence, something is wrong with 4.1.1-1
For the sake of completeness, here is the output when using nvidia/clara/clara-parabricks:4.0.1-1
nvidia-smi
Fri Jun 23 06:46:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:1C.0 Off | 0 |
| N/A 46C P0 26W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:00:1D.0 Off | 0 |
| N/A 48C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 44C P0 26W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
pbrun fq2bam \
--in-fq "sample_1.fq.gz" "sample_2.fq.gz" \
--ref $INDEX \
--out-bam "Tutorial_Sample_1.pb.bam" \
--logfile "Tutorial_Sample_1.FQ2BAM.log.txt" \
--out-duplicate-metrics "Tutorial_Sample_1.duplicates_metrics.txt" \
--num-gpus 4
[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for parabricks_example_data/sample_1.fq.gz
and parabricks_example_data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2023-Jun-23 06:46:21] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:46:21] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jun-23 06:46:21] || Version 4.0.1-1 ||
[PB Info 2023-Jun-23 06:46:21] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2023-Jun-23 06:46:21] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Warning 2023-Jun-23 06:46:35][ParaBricks/src/pbOpts.cu:316]
WARNING
The system has 186 GB, however recommended RAM with 4 GPU is 196 GB.
The run might not finish or might have less than expected performance.
[PB Info 2023-Jun-23 06:46:36] GPU-BWA mem
[PB Info 2023-Jun-23 06:46:36] ProgressMeter Reads Base Pairs Aligned
[PB Info 2023-Jun-23 06:46:50] 5043564 560000000
[PB Info 2023-Jun-23 06:46:56] 10087128 1180000000
[PB Info 2023-Jun-23 06:47:03] 15130692 1720000000
[PB Info 2023-Jun-23 06:47:09] 20174256 2340000000
[PB Info 2023-Jun-23 06:47:16] 25217820 2890000000
[PB Info 2023-Jun-23 06:47:22] 30261384 3460000000
[PB Info 2023-Jun-23 06:47:28] 35304948 4060000000
[PB Info 2023-Jun-23 06:47:35] 40348512 4650000000
[PB Info 2023-Jun-23 06:47:41] 45392076 5200000000
[PB Info 2023-Jun-23 06:47:48] 50435640 5820000000
[PB Info 2023-Jun-23 06:47:58]
GPU-BWA Mem time: 82.209722 seconds
[PB Info 2023-Jun-23 06:47:58] GPU-BWA Mem is finished.
[main] CMD: /usr/local/parabricks/binaries//bin/bwa mem -Z ./pbOpts.txt parabricks_reference_data/fasta/Homo_sapiens_assembly38.fasta parabricks_example_data/sample_1.fq.gz parabricks_example_data/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[main] Real time: 96.616 sec; CPU: 3141.947 sec
[PB Info 2023-Jun-23 06:47:58] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:47:58] || Program: GPU-BWA mem, Sorting Phase-I ||
[PB Info 2023-Jun-23 06:47:58] || Version: 4.0.1-1 ||
[PB Info 2023-Jun-23 06:47:58] || Start Time: Fri Jun 23 06:46:21 2023 ||
[PB Info 2023-Jun-23 06:47:58] || End Time: Fri Jun 23 06:47:58 2023 ||
[PB Info 2023-Jun-23 06:47:58] || Total Time: 1 minute 37 seconds ||
[PB Info 2023-Jun-23 06:47:58] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:00] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:00] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jun-23 06:48:00] || Version 4.0.1-1 ||
[PB Info 2023-Jun-23 06:48:00] || Sorting Phase-II ||
[PB Info 2023-Jun-23 06:48:00] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:00] progressMeter - Percentage
[PB Info 2023-Jun-23 06:48:00] 0.0 0.00 GB
[PB Info 2023-Jun-23 06:48:10] Sorting and Marking: 10.000 seconds
[PB Info 2023-Jun-23 06:48:10] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:10] || Program: Sorting Phase-II ||
[PB Info 2023-Jun-23 06:48:10] || Version: 4.0.1-1 ||
[PB Info 2023-Jun-23 06:48:10] || Start Time: Fri Jun 23 06:48:00 2023 ||
[PB Info 2023-Jun-23 06:48:10] || End Time: Fri Jun 23 06:48:10 2023 ||
[PB Info 2023-Jun-23 06:48:10] || Total Time: 10 seconds ||
[PB Info 2023-Jun-23 06:48:10] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:10] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:10] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jun-23 06:48:10] || Version 4.0.1-1 ||
[PB Info 2023-Jun-23 06:48:10] || Marking Duplicates, BQSR ||
[PB Info 2023-Jun-23 06:48:10] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:10] progressMeter - Percentage
[PB Info 2023-Jun-23 06:48:20] 55.3 8.58 GB
[PB Info 2023-Jun-23 06:48:30] 100.0 0.00 GB
[PB Info 2023-Jun-23 06:48:30] BQSR and writing final BAM: 20.035 seconds
[PB Info 2023-Jun-23 06:48:30] ------------------------------------------------------------------------------
[PB Info 2023-Jun-23 06:48:30] || Program: Marking Duplicates, BQSR ||
[PB Info 2023-Jun-23 06:48:30] || Version: 4.0.1-1 ||
[PB Info 2023-Jun-23 06:48:30] || Start Time: Fri Jun 23 06:48:10 2023 ||
[PB Info 2023-Jun-23 06:48:30] || End Time: Fri Jun 23 06:48:30 2023 ||
[PB Info 2023-Jun-23 06:48:30] || Total Time: 20 seconds ||
[PB Info 2023-Jun-23 06:48:30] ------------------------------------------------------------------------------
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
As evident from nvidia-smi
from both the above runs:
so - it could very well be that the CUDA version 12.0 is causing these particular issues with nvidia/clara/clara-parabricks:4.1.1-1