PARABRICKS mem from pbrun germline command hanging and not finishing

Our pbrun germline command seems to be stuck. We run this job on our high performance compute cluster, requesting for
max, 64 cores, 4 GPUs and 175GB memory.

Here’s the command in our pipeline that’s stuck.

pbrun germline \
  --ref Homo_sapiens_assembly38.fasta \
  --in-fq \
    220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz \
    220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz \
    '@RG\tID:SAM001-FAM001_220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tLB:SAM001-FAM001\tPL:Illumina\tPU:SAM001-FAM001_220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tSM:SAM001-FAM001' \
  --in-fq \
    220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz \
    220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz \
    '@RG\tID:SAM001-FAM001_220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tLB:SAM001-FAM001\tPL:Illumina\tPU:SAM001-FAM001_220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000\tSM:SAM001-FAM001' \
  --knownSites Mills_and_1000G_gold_standard.indels.hg38.vcf_original \
  --out-variants align/SAM001-FAM001.vcf \
  --out-recal-file align/SAM001-FAM001.report.txt \
  --out-bam align/SAM001-FAM001.bam \
  --memory-limit 175

------------------------------------------------------------------------------
||                 Parabricks accelerated Genomics Pipeline                 ||
||                              Version 3.6.1-1                             ||
||                       GPU-BWA mem, Sorting Phase-I                       ||
||                  Contact: Parabricks-Support@nvidia.com                  ||
------------------------------------------------------------------------------


WARNING
The system has 188 GB, however recommended RAM with 4 GPU is 196 GB.
The run might not finish or might have less than expected performance.
[M::bwa_idx_load_from_disk] read 3171 ALT contigs

GPU-BWA mem
ProgressMeter   Reads           Base Pairs Aligned
[16:24:42]      5061134         750000000
[16:25:10]      10122258        1510000000
[16:25:38]      15183380        2280000000
[16:26:05]      20244526        3010000000
[16:26:33]      25305640        3780000000
[16:27:01]      30366776        4570000000

...

[17:16:09]      531521824       79830000000
[17:17:09]      536582750       80590000000
[17:17:10]      536582750       80600000000
[17:17:41]      541710238       81350000000
[17:18:22]      546771102       82110000000

# No more output from this point

Based on my calculations, there should be (1183551056+128623608)/4=328043666 (line counts on each of the R1 fastqs) reads so I’m not sure how the progress is
counting reads either.

Running htop to monitor this job, I noticed the following interesting observations:

  • Memory is pretty much full, sitting at 184/188GB
  • Swap memory is 100% full, 10.0/10.0GB
  • PARABRICKS mem command is using 100% CPU on one of the cores. This process changes cores occassionally but always
    puts the core at 100%.
  • Eventually this job reaches a point where the load average on this node reaches 200+. On a 64 CPU node, this is more
    than 3x utilisation. It reaches a point where I cannot SSH into the node anymore.

What is the PARABRICKS mem command? What is it doing and is this expected behaviour? How long is this command
expected to run typically?

Here are more numbers for fastq sizes, CPU, memory, GPUs.

220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz=2.6G
220414_A00692_0278_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz=2.5G
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R2.fastq.gz=24G
220218_A01221_0103_ML220267_SAM001-FAM001_MAN-20220202_ILMNDNAPCRFREE_L000_R1.fastq.gz=22G

$ lscpu CPU(s):
CPU(s):                64
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           188G        9.0G        178G         18M        461M        178G
Swap:            9G         78M        9.9G

$ nvidia-smi
Mon Jun 20 20:02:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:06:00.0 Off |                    0 |
| N/A   40C    P0    26W /  70W |      0MiB / 15109MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:2F:00.0 Off |                    0 |
| N/A   39C    P0    26W /  70W |      0MiB / 15109MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0    26W /  70W |      0MiB / 15109MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   38C    P0    26W /  70W |      0MiB / 15109MiB |      4%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Update 2022-06-22:

For some reason this exact same job started working again without us making any changes on the cluster node nor pipeline. After finishing, here are the observations for a successful run on this same job:

  • Memory usage never went above 62gb (out of 188gb on this node).
  • Swap memory was never used
  • The reads progress finished on 652087332. This means this progress count is reporting for the pair of reads.
    This also means when my job was stuck (logged above) it was not finished aligning.

Update 2022-06-27:

Our job failed and running again with --x1 --x3 options, here’s some additional output before it failed, (exit code was 255). In fact, it failed several times and every time I restarted it, it progressed a bit further. Our pipeline processes samples one at a time and by progressing further, I mean it went on to process and complete processing samples that it previously failed. This pipeline does not re-process previously successful samples. Each time it failed, below is the output just before the failure.

        ********************* Queue Report ***********************
         Size of Read Queue 10,10,10,10,
         Size of Write Queue 0
         Size of Ready Sort Queue 2646
         Size of Failed Queue 0
         Size of Death Queue 0
        *********************************************************
        Border SAM Processing chunk 8524 on GPU 1 QueueSize 0, numReadsInChunk 66596
        Finish Processing chunk 8518 on GPU 0
        SamGenerator 0 number of bam1_tt in bamtt_cashe  11241911
        Processing chunk 8547 on GPU 0
        Finish Processing chunk 8516 on GPU 1
        SamGenerator 1 number of bam1_tt in bamtt_cashe  10983331
        Border SAM Processing chunk 8529 on GPU 2 QueueSize 0, numReadsInChunk 66596
        Border SAM Processing chunk 8532 on GPU 3 QueueSize 0, numReadsInChunk 66594
        sorting Chunk 66600 Done
        Processing chunk 8549 on GPU 1
        Border SAM Processing chunk 8530 on GPU 0 QueueSize 0, numReadsInChunk 66596
        Finish Processing chunk 8521 on GPU 2
        SamGenerator 2 number of bam1_tt in bamtt_cashe  11372037
        Processing chunk 8548 on GPU 2
        Finish Processing chunk 8523 on GPU 3
        SamGenerator 3 number of bam1_tt in bamtt_cashe  11173198
        Processing chunk 8550 on GPU 3
        Border SAM Processing chunk 8528 on GPU 1 QueueSize 0, numReadsInChunk 66596
        sorting Chunk 66596 Done
        Finish Processing chunk 8522 on GPU 0
        SamGenerator 0 number of bam1_tt in bamtt_cashe  11251267
        Processing chunk 8551 on GPU 0
        Finish Processing chunk 8520 on GPU 1
        SamGenerator 1 number of bam1_tt in bamtt_cashe  10989273
        Border SAM Processing chunk 8533 on GPU 2 QueueSize 0, numReadsInChunk 66594
        sorting Chunk 66596 Done
        Border SAM Processing chunk 8535 on GPU 3 QueueSize 0, numReadsInChunk 66594
        Border SAM Processing chunk 8534 on GPU 0 QueueSize 0, numReadsInChunk 66594
        sorting Chunk 66594 Done
        Finish Processing chunk 8527 on GPU 3
        SamGenerator 3 number of bam1_tt in bamtt_cashe  11181521
        Processing chunk 8554 on GPU 3
        Processing chunk 8553 on GPU 1
        Finish Processing chunk 8525 on GPU 2
        SamGenerator 2 number of bam1_tt in bamtt_cashe  11380370
        Processing chunk 8552 on GPU 2
        Border SAM Processing chunk 8531 on GPU 1 QueueSize 0, numReadsInChunk 66596
        For technical support, updated user guides and other Parabricks documentation can be found at https://docs.nvidia.com/clara/#parabricks
        Answers to most FAQ’s can be found on the developer forum https://forums.developer.nvidia.com/c/healthcare/Parabricks/290
        Customers with paid Parabricks licenses have direct access to support and can contact EnterpriseSupport@nvidia.com
        Users of free evaluation licenses can contact parabricks-eval-support@nvidia.com for troubleshooting any questions.
        Exiting...

        Could not run fq2bam as part of germline pipeline
        Exiting pbrun ...

Hey @tommy.li,

It could be an issue with memory. Can you try running on 2 GPUs instead of 4? Sometimes 4 GPUs can use more than 175 memory. I will loop in the wider team and see if they have anything to say, too.

Thanks!

Hi @tommy.li,

Does this issue happen only for one sample? where the input is is not that important.
But temp directory is a really important and local disk or shared storage might make a difference here.
Does it hang every time?

Thank you.

It could be an issue with memory. Can you try running on 2 GPUs instead of 4? Sometimes 4 GPUs can use more than 175 memory. I will loop in the wider team and see if they have anything to say, too.

We ordered more memory and will be getting that soon. This should help us isolate/rule out whether it’s a memory issue or not.

Our batch run has 42 genome samples but our pipeline processes these using parabricks one at a time. It’s happened to more than one sample so it’s not a data issue. In fact, I managed to complete this batch after a few restarts (i.e. first run 32 went through, restart, another 5 went through, restart, another 4 went through, restart and the final one went through.

When you say “temp directory is a really important”, does parabricks use /tmp or is this configured somewhere? Also, how much space is needed for this temp directory?