Our pbrun germline
command seems to be stuck. We run this job on our high performance compute cluster, requesting for
max, 64 cores, 4 GPUs and 175GB memory.
Here’s the command in our pipeline that’s stuck.
pbrun germline \
--ref Homo_sapiens_assembly38.fasta \
--in-fq \
220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz \
220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz \
'@RG\tID:SAM001-FAM001_220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tLB:SAM001-FAM001\tPL:Illumina\tPU:SAM001-FAM001_220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tSM:SAM001-FAM001' \
--in-fq \
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz \
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz \
'@RG\tID:SAM001-FAM001_220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tLB:SAM001-FAM001\tPL:Illumina\tPU:SAM001-FAM001_220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tSM:SAM001-FAM001' \
--knownSites Mills_and_1000G_gold_standard.indels.hg38.vcf_original \
--out-variants align/SAM001-FAM001.vcf \
--out-recal-file align/SAM001-FAM001.report.txt \
--out-bam align/SAM001-FAM001.bam \
--memory-limit 175
------------------------------------------------------------------------------
|| Parabricks accelerated Genomics Pipeline ||
|| Version 3.6.1-1 ||
|| GPU-BWA mem, Sorting Phase-I ||
|| Contact: Parabricks-Support@nvidia.com ||
------------------------------------------------------------------------------
WARNING
The system has 188 GB, however recommended RAM with 4 GPU is 196 GB.
The run might not finish or might have less than expected performance.
[M::bwa_idx_load_from_disk] read 3171 ALT contigs
GPU-BWA mem
ProgressMeter Reads Base Pairs Aligned
[16:24:42] 5061134 750000000
[16:25:10] 10122258 1510000000
[16:25:38] 15183380 2280000000
[16:26:05] 20244526 3010000000
[16:26:33] 25305640 3780000000
[16:27:01] 30366776 4570000000
...
[17:16:09] 531521824 79830000000
[17:17:09] 536582750 80590000000
[17:17:10] 536582750 80600000000
[17:17:41] 541710238 81350000000
[17:18:22] 546771102 82110000000
# No more output from this point
Based on my calculations, there should be (1183551056+128623608)/4=328043666 (line counts on each of the R1 fastqs) reads so I’m not sure how the progress is
counting reads either.
Running htop to monitor this job, I noticed the following interesting observations:
- Memory is pretty much full, sitting at 184/188GB
- Swap memory is 100% full, 10.0/10.0GB
-
PARABRICKS mem
command is using 100% CPU on one of the cores. This process changes cores occassionally but always
puts the core at 100%. - Eventually this job reaches a point where the load average on this node reaches 200+. On a 64 CPU node, this is more
than 3x utilisation. It reaches a point where I cannot SSH into the node anymore.
What is the PARABRICKS mem
command? What is it doing and is this expected behaviour? How long is this command
expected to run typically?
Here are more numbers for fastq sizes, CPU, memory, GPUs.
220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz=2.6G
220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz=2.5G
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz=24G
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz=22G
$ lscpu CPU(s):
CPU(s): 64
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
$ free -h
total used free shared buff/cache available
Mem: 188G 9.0G 178G 18M 461M 178G
Swap: 9G 78M 9.9G
$ nvidia-smi
Mon Jun 20 20:02:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:06:00.0 Off | 0 |
| N/A 40C P0 26W / 70W | 0MiB / 15109MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:2F:00.0 Off | 0 |
| N/A 39C P0 26W / 70W | 0MiB / 15109MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 38C P0 26W / 70W | 0MiB / 15109MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 38C P0 26W / 70W | 0MiB / 15109MiB | 4% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+