Robin_hood::map overflow in GATK HaplotypeCaller (parabricks 4.0.0)

Dear Parabricks developers;

I ran haplotypecaller (parabricks 4.0.0) using the command as below:
pbrun haplotypecaller --gvcf --ref ref/Homo_sapiens_assembly38.fasta
–in-bam output/cram/SC501096.cram --out-variants output/gatkhc_pb/SC501096.gatk.g.vcf
bgzip -c output/gatkhc_pb/SC501096.gatk.g.vcf > output/gatkhc_pb/SC501096.gatk.g.vcf.gz
tabix output/gatkhc_pb/SC501096.gatk.g.vcf.gz

And I got segmentation fault with the error message as below:

[PB Info 2023-Nov-03 15:42:46] chrY:11304001 95.5 16680519 174665
[PB Info 2023-Nov-03 15:42:56] chrY:56832001 95.7 16683660 174393
terminate called recursively
terminate called recursively
terminate called after throwing an instance of ‘std::overflow_error’
what(): robin_hood::map overflow
[PB ESC[31mErrorESC[0m 2023-Nov-03 15:43:03][-unknown-:0] Received signal: 6
[PB ESC[31mErrorESC[0m 2023-Nov-03 15:43:03][-unknown-:0] [PB ESC[31mErrorESC[0m 2023-Nov-03 15:43:03][-unknown-:0] Received signal: 6
[PB ESC[31mErrorESC[0m 2023-Nov-03 15:43:03][-unknown-:0] Received signal: 11
For technical support visit Help - NVIDIA Docs, exiting.
[PB ESC[31mErrorESC[0m 2023-Nov-03 15:43:03][-unknown-:0] Received signal: 11
For technical support visit Help - NVIDIA Docs, exiting.
Segmentation fault (core dumped)

I had repeated several times and got similar kind of errors as below:

[PB Info 2023-Nov-03 17:54:13] chrY:56827201 98.7 16631689 168564
terminate called recursively
terminate called after throwing an instance of ‘std::overflow_error’
what(): robin_hood::map overflow
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:16][-unknown-:0] Received signal: 6
For technical support visit [PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:16][-unknown-:0] Received signal: 6
For technical support visit Help - NVIDIA Docs, exiting.
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:16][-unknown-:0] Received signal: 11
terminate called recursively
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:16][-unknown-:0] Received signal: 6
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:16][src/likehood_test.cu:654] cudaSafeCall() failed: driver shutting down, exiting.
[PB Warning 2023-Nov-03 17:54:16][src/regions.cpp:2780] Haplotype length 354 < kmerSize 1446944784

[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:17][src/likehood_test.cu:654] cudaSafeCall() failed: driver shuttin
g down, exiting.
terminate called recursively
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:17][-unknown-:0] Received signal: 6
[PB Info 2023-Nov-03 17:54:23] chrY:56827201 98.8 16632898 168292
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:28][-unknown-:0] Received signal: 11

For technical support visit Help - NVIDIA Docs, exiting.
[PB ESC[31mErrorESC[0m 2023-Nov-03 17:54:28][-unknown-:0] Received signal: 11
[PB Info 2023-Nov-03 17:54:33] chrY:56827201 99.0 16632898 168009
[PB Info 2023-Nov-03 17:54:43] chrY:56827201 99.2 16632898 167726

[PB Info 2023-Nov-06 14:04:08] chrY:56827201 4248.5 16632898 3915
[PB Info 2023-Nov-06 14:04:18] chrY:56827201 4248.7 16632898 3914
[PB Info 2023-Nov-06 14:04:28] chrY:56827201 4248.8 16632898 3914
[PB Info 2023-Nov-06 14:04:38] chrY:56827201 4249.0 16632898 3914
[PB Info 2023-Nov-06 14:04:48] chrY:56827201 4249.2 16632898 3914

Different from the previous one, the job kept running forever (wo segmentation fault).

Below is the information of the nvidia driver installed at our server:
nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-d0bd9105-731d-58af-2b04-27ca2770e0e2)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ba02d87e-2d94-d55f-0504-cf980c663070)

nvidia-smi
Mon Nov 6 14:06:40 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | Off |
| N/A 35C P0 59W / 300W | 13026MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:B2:00.0 Off | Off |
| N/A 35C P0 56W / 300W | 13026MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3933025 C …bricks/binaries//bin/htvc 13019MiB |
| 1 N/A N/A 3933025 C …bricks/binaries//bin/htvc 13019MiB |
±----------------------------------------------------------------------------+

Your help is greatly appreciated. Thanks,

Wei

Wei Zhu

Hello @zhuw10,

Thank you for posting about your error. I have a few questions / requests:

  1. Do you need version 4.0.0, would it be possible to use the latest version 4.2.0?
  2. You can add the --low-memory flag and see if you get the same error
  3. Would it be possible for you to run on a newer driver? It looks like you’re using 470 and the latest driver version is 525.

Thank you.

Thank you for the tips and the prompt reply. I cannot update the driver as our HPC cluster is shared by many different users, and I am just one of the users. I may test to use the newer version and/or --low-memory flag.

I had not run into such problem before when using bam files as input. I wonder whether there is some issue related with the cram input files in the use of GATK HaplotypeCaller in Parabricks.

Thanks,
Wei

By the way, I cannot find any option to use “–low-memory flag” in pbrun haplotypecaller. Could you specify some details?

Thanks,
Wei

Hi @zhuw10,

The tool should be able to handle both CRAM and BAM files, unless there is something wrong with the file itself. Can you tell me more about what’s in this CRAM file? Sequencing coverage, size in GB, etc.?

And apologies, the low-memory option is not available for this tool.

You could also maybe try using the --run-parition flag to break up the processing of the file and see if that helps.

I figured out the cause of the issue: “picard AddOrReplaceReadGroups” had been used to replace read group AND also save the output as cram files. The resultant cram files are problematic. After regenerating cram files using samtools, all work fine now.

Thanks for your helps any way and please close this issue.

Thanks,

Wei

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.