Internal Sanitizer Error: Failed to generate coredump when allocated global memory is in a certain range

running compute-sanitizer with memcheck tool with command line

$ compute-sanitizer --print-level info --launch-timeout 0 --target-processes=all --generate-coredump --coredump-name /tmp/gpudump.nvcudmp --error-exitcode 1 --tool memcheck ./test
[...]
[16:57:09.942306] coredump: Writing out global memory (59403808 bytes)
[16:57:10.039514] coredump: 5%...
[16:57:10.039527] coredump: 10%...
[16:57:10.039530] coredump: 15%...
[16:57:10.039531] coredump: 20%...
[16:57:10.039533] coredump: 25%...
[16:57:10.040535] coredump: 30%...
[16:57:10.052751] coredump: 35%...
[16:57:10.137441] coredump: 40%...
[16:57:10.137451] coredump: 45%...
[16:57:10.137453] coredump: 50%...
[16:57:10.137455] coredump: 55%...
[16:57:10.137456] coredump: 60%...
[16:57:10.161302] coredump: 65%...
[16:57:10.173038] coredump: 70%...
[16:57:10.196793] coredump: 75%...
[16:57:10.208797] coredump: 80%...
========= Internal Sanitizer Error: Failed to generate coredump
=========

coredump generation fails if the global memory allocated is not in the power of 2.

sample program attached to reproduce this issue. nvidia-smi output attached.

I can reproduce this issue when I set
CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1

CUDA_COREDUMP_SHOW_PROGRESS=1

and then run the executable of sample program outside compute-santizer. Issue in coredumping section.

nvidia-smi.txt (2.0 KB)

reproduce_coredump_failure_cu.txt (3.3 KB)

adding support for CUDA_COREDUMP_GENERATION_FLAGS could be usefull here. I will skip dumping global memory with this env variable.

========= Variable environment CUDA_COREDUMP_GENERATION_FLAGS is not supported by compute-sanitizer, clearing it before target process launch.

Hi. @goutham24693

Is it still a issue for you now ? Can you clarify your request ?

Issue:

The coredump process starts normally (reaches 75-80% completion).

Then fails with “Internal Sanitizer Error: Failed to generate coredump”.

This issue is still present on toolkit version 12.9.

Please look at the attached sample program to reproduce the error. Also attached nvidia-smi output.

Let me know if you need me to clarify something in particular.

Hi, @goutham24693

I can generate the core dump using your source file + 12.9 sanitizer without any issue.

Here are some advices:

  1. I see there are 2 GPU devices listed in your env, you can try if each one has the problem by setting CUDA_VISIBLE_DEVICES.
  2. You can try to upgrade to CUDA 13.0+580 driver to see if this still repro.

Hi,

can you please tell me how to resolve this NvEncodeAPICreateInstance creation error result:
NV_ENC_STATUS.ERR_INCOMPATIBLE_CLIENT_KEY
where can i get this client_key i am trying to build nvenc encoder and testing on Geforce 1660 super with latest nvidia drivers

@ramiz272

Your question seems unrelated with compute-sanitizer, can you please find proper forum to ask ?
If it is related CUDA programming, you can ask there.

Hi,
I tried building and running executable from file (attached) - reproduce_coredump_failure_cu.txt
on a server with single GPU card. refer

attached nvidia-smi output (single_gpu_nvidia_smi.txt)

I can still reproduce the issue. More logs in file - single_gpu_compute_sanitizer_run.txt

As of now, we dont have any plans to update to cuda toolkit + driver or the GPU card on our servers.

Would be great if there is some way to fix this issue. I can reproduce this issue 100% of the time with simple null pointer in device code. Let me know if you need any details to debug.

Thanks.

single_gpu_reproduce_coredump_issue.txt (3.4 KB)

single_gpu_nvidia_smi.txt (1.7 KB)

single_gpu_compute_sanitizer_run.txt (28.6 KB)

Hi, @goutham24693

Thanks for the detailed info.
But we have tried internally, still can not reproduce. The coredump can be generated successfully.
Attached the output.

test-log-R575.txt (31.0 KB)

Hi @veraj,

Thank you for posting the logs from your run. It looks like you are running the program in a windows machine.
I tried the same in my windows PC and it did not reproduce.
Can you please try compiling and running the previously attached file on a linux machine ?

The reported issue is only seen on linux as indicated from the nvidia-smi results posted on my older comments.

Thanks.

Hi, @goutham24693

You are right. We can reproduce on Linux with CUDA 12.9+575.51.03 driver.
I will involve dev to check this.