I’m running the simplest cuda code which only calls the cudaMemGetInfo function on an IBM server with 4 GPUs.
When I link the application to the -ltcmalloc library, it seg-faults. If I link the application WITHOUT the -ltcmalloc all goes well.
If I do link with -ltcmalloc but specify a CUDA_VISIBLE_DEVICES parameter (with any value, for example 100), it also runs.

I really have no idea what happens, maybe the CUDA_VISIBLE_DEVICES parameter causes CUDA to load first its libraries prior to the tcmalloc lib and therefore prevents the crash?

Any ideas?


Any ideas anyone?

The observation about the environment variable is likely a red herring. This is similar to a case in which I could make an app pass or fail by defining a nonsense environment variable FOO=BAR, which was not used anywhere. Took me some hours to figure out what was going on.

Hypothesis: The application is reading out-of-bounds (or may be in-bounds, but uninitialized) memory somewhere. As a consequence, it happens to access memory whose content is influence by the setting of the environment variable CUDA_VISIBLE_DEVICES, possibly the environment itself. If the environment variable is defined, the “random” data picked up is benign enough to let the app survive. Otherwise, the data picked will make the app crash.

Thanks for the prompt answer :)

The red-herring was my assumption as well.

Nevertheless the code is just this:
#include <cuda.h>
#include <cuda_runtime_api.h>

int main() {
size_t avail, total;
cudaMemGetInfo(&avail, &total);

And I’m linking it as follow:
nvcc -L /mypath/ -ltcmalloc

So the question is what is causing the faulty thing and how can the CUDA_VISIBLE_DEVICES setting influence somehow on such a simple test case??
I was thinking maybe the CUDA_VISIBLE_DEVICES causes the runtime to load prior/after the tcmalloc library?

It also fails even if I put a cudaDeviceReset() or cudaFree(0) instead/before the cudaMemGetInfo call… so even something which does not use my variables?

(edit) BTW - using FOO=BAR instead of the CUDA_VISIBLE_DEVICES=XX did NOT solve the problem. Could it possibly not be a red-herring after all?


Is the segfault reproducible when you run it inside the debugger? If so, you should have the exact instruction and address that causes the segfault and can work backwards from there.

I don’t know anything about libtcmalloc (never used it, never heard of it), so I can’t tell you why it might be causing segfaults, or what interaction (if any) it may have with any CUDA component.


The simple repro test crashes in DEBUG as well - the code just calls cuInit(0) as first command in the main function (when linked to the tcmalloc library). Bellow is the stacktrace. However I’m not sure
I can make anything out of this…

#0 TryPop (rv=, this=0x10090c60) at src/thread_cache.h:220
#1 Allocate (cl=, size=8, this=) at src/thread_cache.h:381
#2 malloc_fast_pathtcmalloc::allocate_full_malloc_oom (size=) at src/
#3 tc_malloc (size=) at src/
#4 0x0000200000329d1c in ?? () from /lib64/
#5 0x0000200000329ea4 in ?? () from /lib64/
#6 0x000020000032a470 in ?? () from /lib64/
#7 0x0000200000379850 in ?? () from /lib64/
#8 0x0000200000322d6c in ?? () from /lib64/
#9 0x0000200000323a78 in ?? () from /lib64/
#10 0x000020000020b484 in ?? () from /lib64/
#11 0x0000200000393120 in cuInit () from /lib64/
#12 0x000000001000107c in main (argc=2, argv=0x7fffffffe748) at

Just as a FYI. This has been confirmed as a bug on NVIDIA’s driver when used with TCmalloc and IBM platforms.


Thanks for closing the loop. Yeah, sometimes when weird things happen it is actually due to a bug in NVIDIA’s software :-)

Hi eyalhir74, did this bug ever get patched to your knowledge? If so, do you know in which driver version? I’m experiencing similar weirdness with IBM + TCmalloc + simple CUDA calls failing.


The issue appears to have been fixed in 418.40.04 and subsequent drivers for the P9 platform.

If you are installing CUDA, 10.1U1 and newer should have fixed drivers bundled in their runfile installers:

Otherwise, the latest Tesla V100 driver for ppc64le is here, and should have the fix:

Thanks for the quick reply. I’m on 418.67, CUDA 10.1, on P9, so maybe this isn’t my issue.

In case it jogs any ideas, I’ll mention that the main error I’m seeing is “CUDA Runtime API error 2 on device 4: out of memory” on the very first CUDA call in the entire code, which is the seemingly-innocuous cudaGetDeviceCount(). The GPUs are clean and empty, as confirmed with nvidia-smi (and there are no other users on the node). Unfortunately, the problem is very hard to reproduce reliably. It’s entirely possible/likely it’s a bug on my end and not CUDA’s, but I thought I’d ask in case someone has seen this before.

The other way I’ve seen this bug present is with a TCmalloc “Attempt to free invalid pointer” error from cudaDeviceSetCacheConfig(). That’s what led me here.