Pbbwa hangs and don't writes bam, but the memory is occupied

Hi,

I am running series of fq2bam+deepvariant jobs. Some of the jobs hang, and it can’t produce bam file. I noticed the status of the job, and I can see it hangs in the pbbwa process. It can’t go to the postsort step.

The following stdout is noticed:

[PB Info 2025-Jan-30 10:11:41] Single-ended recovery mode for batch with 885489664 (both ends) reads before itself
[PB Info 2025-Jan-30 10:11:41] Single-ended recovery mode for batch with 885538816 (both ends) reads before itself
[PB Info 2025-Jan-30 10:11:41] Single-ended recovery mode for batch with 885522432 (both ends) reads before itself
[PB Info 2025-Jan-30 10:11:41] Single-ended recovery mode for batch with 885555200 (both ends) reads before itself
[PB Info 2025-Jan-30 10:11:41] Single-ended recovery mode for batch with 885571584 (both ends) reads before itself
[PB Info 2025-Jan-30 10:11:50] # 100  4 100 59 101   0 pool:  6 132287419383 bases/GPU/minute: 52131312.0 
[PB Info 2025-Jan-30 10:12:00] # 100  4 100 59 101   0 pool:  6 132287419383 bases/GPU/minute: 0.0 
[PB Info 2025-Jan-30 10:12:10] # 100  4 100 59 101   0 pool:  6 132287419383 bases/GPU/minute: 0.0 
[PB Info 2025-Jan-30 10:12:20] # 100  4 100 59 101   0 pool:  6 132287419383 bases/GPU/minute: 0.0

This output continues forever.

sudo journalctl -k --since “60 minutes ago”

shows:

Jan 30 11:12:09 vs2000.vll.se kernel: audit: type=1400 audit(1738231929.932:14528): apparmor="ALLOWED" operation="open" class="file" profile="/usr/sbin/sssd" name>
Jan 30 11:17:01 vs2000.vll.se kernel: audit: type=1400 audit(1738232221.857:14529): apparmor="ALLOWED" operation="open" class="file" profile="/usr/sbin/sssd" name>
Jan 30 11:17:01 vs2000.vll.se kernel: audit: type=1400 audit(1738232221.859:14530): apparmor="ALLOWED" operation="open" class="file" profile="/usr/sbin/sssd" name>

The process doesn’t release the memory though. It occupies ~20% of GPU RAM. Out of the two processors, the job occupies 1899% of the CPU of one processor.

I am running the program on dual H100; RAM: 384G; Ubuntu 22.

Any suggestions on how to fix this?

Thanks,
snandi