Interaction with standard C heap varies

carl.nettelblad · November 26, 2021, 7:06am

Related to my previous post on bugs in initializing globals, I note a difference which seems like a regression to me, but maybe it is more a matter of being extremely lucky on some platforms and not on others.

In CentOS7 (bare metal) and with the nvhpc 21.9 Ubuntu 20.04 container, I can build a separate library using g++ and normal linking. This library allocates memory using the system heap. I currently believe it never actually asks the main calling process to free memory allocated by it, nor is it trying to free memory allocated by the main binary. Anyway, linking to it and using it “just works”. The memory touched by it is never actually going into the GPU hotpath.

On CentOS 8, on the other hand, linking to this library and calling some trivial functions within it cause heap corruption, with the behavior looking like the library tries to free memory using glibc that was originally allocated by the managed heap. Again; I don’t see where that allocation would take place, but it seems like it does.

I’ve verified with AddressSanitizier using the pure g++ build (of main code and library) that there are no obvious heap corruption issues that would just have remained hidden.

An observation is that symbols like “free” are provided in glibc on CentOS 8 when I run the binary, but by ld-linux-x86-64.so.2 when I run in the Ubuntu container.

The library is statically linked, so I could imagine that the Nvidia linker, when managed memory is enabled, would redirect some calls, or even manage to do it for shared libraries similar to how you can swap in tcmalloc and other dropin malloc replacements with LD_PRELOAD. If this is what’s going on, it seems like some difference in CentOS 8, e.g. some symbol naming change maybe, makes it break and break bad.

I should note that I never even rebuilt the static library when trying the Ubuntu container. That is, a static lib built on CentOS 8 links fine and gives no heap corruption crashes, when I build and run the main binary linking to it within a Ubuntu 20.04 NVHPC SDK container.

So, two questions:

Am I just lucky that this works on CentOS 7 and Ubuntu, or is it expected that all heap allocations will be redirected?
Are you aware of some regression on CentOS/RHEL 8?

MatColgrove · November 29, 2021, 11:08pm

Hi Carl,

You note that you’re building the library with g++ but also talk about the memory not being touched by the GPU hotpath. Given in your last post you were using nvc++ with OpenMP target offload, were you meaning to say that you’re building the library with nvc++?

Am I just lucky that this works on CentOS 7 and Ubuntu, or is it expected that all heap allocations will be redirected?

With managed memory, there’s no runtime redirection of allocation calls. Rather during compilation, allocators such as “new” or “malloc”, are replaced with calls to cudaMallocManaged.

Are you aware of some regression on CentOS/RHEL 8?

No, I’m not aware of anything.

Could the problem be a mismatch in the CUDA version that library was built with and the version installed on CentOS8?

-Mat

carl.nettelblad · December 1, 2021, 8:22am

My main binary is built using nvc++. When I first developed this code on CentOS7, I just happily kept linking to the same static library I had been using, built using g++. It “just worked” and I didn’t pay it much thought.

When running on CentOS8, I get the global initialization errors I mentioned in another thread if I build the library using nvc++, with the very same version of the nvhpc sdk that I use for the main binary. Other, simple, binaries built with that setup using both cuda and OpenMP work fine.

If I instead build the library using g++ on CentOS8, I get the heap corruption issue.

If I keep using that very same version of the static library .a file, but build the binary within a Ubuntu or CentOS7 container, running on the CentOS8 machine, it works fine.

To be specific, is the replacement of calls to malloc (etc) a link-time or compile-time thing, or both? Since the library is static, one explanation for why it works would be that any explicit dynamic memory allocations are rerouted just by the fact that the final linking is performed in a call to nvc++ with the proper flags.

In a sense, I’m just as surprised that it “just seems to work” on CentOS7 and Ubuntu 20.04 (bare metal and containerized userspace), as that it stops working on CentOS8. And, again, this even happens when I actually keep the exact .a file built using the CentOS8 gcc toolchain, just compiling and linking the main binary using the container nvhpc sdk. Trying to run a CentOS8 main binary in the other distros will cause an error due to glibc version mismatches.

MatColgrove · December 1, 2021, 5:48pm

The replacement of the memory allocators with cudaMallocManaged occurs at compile time and only for the particular source being compiled. At link, adding the “-gpu=managed” flag only sets the initialization of the OpenMP or OpenACC runtime to check for managed memory when entering a data region or offloaded region. It does not reroute non-managed allocators to use managed memory.

I should be a bit more specific in that the actual replacement is to managed memory pool allocator which is used a to reduce the overhead cost of cudaMallocManaged. Full details: HPC Compilers User's Guide Version 23.11 for ARM, OpenPower, x86

Topic		Replies	Views
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	747	December 30, 2022
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37701	April 23, 2009
Missing cuda device code when trying to link nvc object file with gcc nvc, nvc++ and nvfortran	3	1221	March 4, 2022
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	872	April 13, 2022
Nvc++ within conda / micromamba nvc, nvc++ and nvfortran	14	880	March 14, 2024
Dynamic Heap initialization CUDA Programming and Performance	12	371	June 24, 2024
Shared library with openacc code and ccall only runs on hosts's gpu arch nvc, nvc++ and nvfortran	17	87	July 30, 2024
Missing relocation entries in shlib compiled with OpenACC nvc, nvc++ and nvfortran	12	45	March 12, 2025
malloc isn't found when used in a header file CUDA Programming and Performance	9	5088	December 19, 2010
Without debug flags: cuMemcpyHtoDAsync returned error 719 for declare create variable nvc, nvc++ and nvfortran	7	34	October 17, 2024

Interaction with standard C heap varies

Related topics