Potential NVSHMEM allocated memory performance issue

WenleiBao · December 8, 2023, 8:11am

Hello,

We seems to find some perf issue when using NVSHMEM allocated memory, e.g. nvshmem_malloc, for memory load operations, such as load data from global memory to shared memory, say 4096 x 12288.
The potential issue might related to ptx instructions cp async:
" cp.async.cg.shared.global.L2::128B [%1], [%2], %3;\n" and / or (we actually tested both)
" cp.async.cg.shared.global [%1], [%2], %3;\n"

%3 is 16 for both instructions.

So If we use allocated buffer from nvshmem_malloc vs normal torch allocated one, we observe ~1ms difference, meaning with nvshmem_malloc’ed buffer, the code take ~1ms more to finish. Same issue is observed even with nvshmem_align API.

After some dig, we observe the difference gone if we comment out the above cp async part of code. Meaning if we don’t use above ptx inst, switching nvshmem_malloc memory vs normal one makes no difference (the program still doing work and takes time.)

Is there any difference for nvshmem_malloc memory that could cause cp async ptx instrucitons different behaviors? Can anyone please help with it?

Many Thanks.

jdinan · December 8, 2023, 10:32pm

Thanks for reaching out. What GPU are you testing with?

One possible issue could be that the buffer alignment is coming out differently between the two allocators that you used. Can you check that the buffer is 16B aligned in both cases?

WenleiBao · December 11, 2023, 6:44am

Thanks for your reply. My testing is on A100 GPUs.

One possible issue could be that the buffer alignment is coming out differently between the two allocators that you used.

As I mentioned, I have also tested the memory allocated via nvshmem_align. It still have extra overhead unfortunately.

Can you check that the buffer is 16B aligned in both cases?

Yes. Otherwise, the ptx instruction will complain about memory alignment issue.

jdinan · December 11, 2023, 4:04pm

Could you please try setting the environment variable NVSHMEM_DISABLE_CUDA_VMM=1? This disables automatic symmetric heap sizing, so you may also need to increase NVSHMEM_SYMMETRIC_SIZE. More information on these env vars is available here: Environment Variables — NVSHMEM 2.10.1 documentation

WenleiBao · December 13, 2023, 12:48am

Thanks for your kind reply! I will test it later and get back to you. Thank you!

WenleiBao · January 2, 2024, 9:24pm

Sorry for the late reply. I have verified that with dynamic heap disabled NVSHMEM_DISABLE_CUDA_VMM=1. The runtime delay disappears.

Also searched a bit in the code, and it shows “Disabling VMM usage (dynamic heap) by setting NVSHMEM_DISABLE_CUDA_VMM=1 could provide better performance.”

I am wondering is there any other side effects that disable dynamic heap as default?

Thank you!

WenleiBao · January 2, 2024, 11:32pm

BTW I found another potential issue with NVShmem allocated buffer:
If I allocated a buffer using nvshmem_malloc, let’s say a 64 int buffer, and ptr is the pointer.
And then I passed that ptr to CUStreamWriteValue API.

Compilation is fine, but when I run it.
I got “Got bad cuda status: invalid argument” if I don’t export NVSHMEM_DISABLE_CUDA_VMM=1.
It would work if I export that env.

Is that expected behavior ? or I just forgot something else when using nvshmem_malloc?

Thanks!

alanger · January 3, 2024, 12:18am

Regarding your first question.
Disabling CUDA VMM should not have other impact other than the fact that you need to set the NVSHMEM_SYMMETRIC_SIZE env var. Using CUDA VMM allows us to dynamically expand NVSHMEM symmetric heap and hence eliminates the need to set NVSHMEM_SYMMETRIC_SIZE.

Regarding cuStreamWriteValue usage, we are not aware of any issue. However, I see that certain flags for this API are not supported on certain platforms. Can you please check for that?
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gf16864e8693d888f8178067470001b21

WenleiBao · January 3, 2024, 12:27am

Thanks for your reply!
Regarding to the NVSHMEM_SYMMETRIC_SIZE env var, could you please explain the below sentence in the doc :
"The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. "
I am not quite sure I understand what is the prefix and scaling factor here means? (maybe an example would be great)

As for cuStreamWriteValue usage: yes. I am aware of that, and we use the flags that supported. That means using shared memory with IPC handle works.

Thanks.

alanger · January 3, 2024, 12:45am

The format allowed for symmetric heap size is prefix scaling. For example, once can set NVSHMEM_SYMMETRIC_SIZE to be 1.1G or 1.2KB, etc. NVSHMEM would expand it to an integer in bytes and round it up.

WenleiBao · January 16, 2024, 11:44pm

Hello,
Sorry to bother. But I have a followup question regarding the NVSHMEM_DISABLE_CUDA_VMM=1 option:
When I set this env var, I hit this issue
"non-zero status: 1 cudaIpcOpenMemHandle failed with error 1 ", at nvshmem/src/host/transport/p2p/p2p.cpp:251,
if I init a 2 node (each have 4 gpus) environment with torch, but it is fine with 1 node each with 8 gpus env.

Note that it failed at runtime, and I use CUDA_VISIBLE_DEVICES to control the view of the node: either 1 node with 8 gpus or 2 nodes with 4 gpus each.

Is this expected ? or maybe I missed something?

Thanks.

arnavg · January 18, 2024, 1:42am

When I set this env var, I hit this issue
"non-zero status: 1 cudaIpcOpenMemHandle failed with error 1 ", at nvshmem/src/host/transport/p2p/p2p.cpp:251,
if I init a 2 node (each have 4 gpus) environment with torch, but it is fine with 1 node each with 8 gpus env. Note that it failed at runtime, and I use CUDA_VISIBLE_DEVICES to control the view of the node: either 1 node with 8 gpus or 2 nodes with 4 gpus each.

Can you please share the run/launch command along with the runtime environment variables that you are using to launch these processes on different and same nodes ? From the log message, it sounds like you are trying to use P2P transport to share memory b/w GPUs on different nodes, which is not supported on A100 based platform.

WenleiBao · January 24, 2024, 7:19pm

Hello,
Sorry for the late replay. Have to context switch to other stuff.

To summarize the problem that we met:

If we set NVSHMEM_DISABLE_CUDA_VMM=1, the memory allocated from nvshm_malloc, and the ptr obtain from nvshmem_ptr() is viewed as invalid argument for driver API cuStreamWaitValue32/_v2, and crash there. (but cuStreamWriteValue32 did not crash )
If not set NVSHMEM_DISABLE_CUDA_VMM, the ptr from nvshmem_ptr() is not viewed as valid arg for cuStreamWriteValue32/_v2 and crashed. (cuStreamWaitValue32 did not crash)

These two cases prevent me using NvShmem to allocate memory in my test.
So I am wondering is there any known issue for above observation?

Thanks.

arnavg · February 5, 2024, 1:07am

Thanks for sharing the insight @WenleiBao if possible, please share an example code of your implementation to better understand the details and build/runtime environment.

For reference, this is example code with 2 PE that I wrote to test to test the hypothesis on A100 GPU with CUDA 11.8 and 520.61.05 CUDA driver.

#define TEST_CUCHECK(fn)                                                              \
    do {                                                                              \
        CUresult err = fn;                                                            \
        if (err != CUDA_SUCCESS) {                                                    \
            const char *errStr;                                                       \
            cuGetErrorString(err, &errStr);                                           \
            fprintf(stderr, "%s:%d Cuda failure '%s'\n", __FILE__, __LINE__, errStr); \
            return -1;                                                                \
        }                                                                             \
    } while (false)

....
nvshmem_init();
// select a CUDA device
// assuming size is 1 byte
char *buffer = (char *)nvshmem_malloc(size);
if (!buffer) {
    fprintf(stderr, "nvshmem_malloc failed \n");
    // goto error path
}

TEST_CUCHECK(cuStreamWriteValue32(0, (CUdeviceptr)buffer, 1, 0));
TEST_CUCHECK(cuStreamWaitValue32(0, (CUdeviceptr)buffer, 1, 0));
nvshmem_free(buffer);
nvshmem_finalize();

If we set NVSHMEM_DISABLE_CUDA_VMM=1, the memory allocated from nvshmem_malloc, and the ptr obtain from nvshmem_ptr() is viewed as invalid argument for driver API cuStreamWaitValue32/_v2, and crash there. (but cuStreamWriteValue32 did not crash )

Can you elaborate with an example how you are using nvshmem_ptr in the above ? The example that I shared above works as expected if CUDA VMM is disabled. Can you share what errorString is reported by cuGetErrorString after invoking cuStreamWriteValue32 ?

WenleiBao · February 19, 2024, 2:10pm

Thanks for your reply @arnavg !
Yes, so if your example has two PE. Then after nvshmem_malloc.
You can get the buffer ptr for the other PE right?
like char *other_buffer = (char *) nvshmem_ptr(buffer, 1);
Then call cuStreamWaitValue32 (0, (CUdeviceptr)other_buffer, 1, 0) should show the problem.
The cuGetErrorString showing : “Got bad cuda status: invalid argument at line: xxx”

From nvshmem_ptr:
https://docs.nvidia.com/nvshmem/api/gen/api/setup.html?highlight=nvshmem_ptr#c.nvshmem_ptr
it says that " nvshmem_ptr returns an address that may be used to directly reference dest on the specified PE. This address can be assigned to a pointer. After that, ordinary loads and stores to dest may be performed." and also " The address returned by nvshmem_ptr is a local address to a remotely accessible data object."

But for cuStreamWaitValue32/_v2,
" * If the memory was registered via ::cuMemHostRegister(), the device pointer

should be obtained with ::cuMemHostGetDevicePointer(). This function cannot
be used with managed memory (::cuMemAllocManaged)."

I am not sure this applied to this case based on the nvshmem_ptr description. But if I use IPC handle to get the remote PE ptr and pass to cuStreamWaitValue32/_v2, it works.

Thanks.

arnavg · February 22, 2024, 6:13am

Yes, so if your example has two PE. Then after nvshmem_malloc.
You can get the buffer ptr for the other PE right?
like char *other_buffer = (char *) nvshmem_ptr(buffer, 1);
Then call cuStreamWaitValue32 (0, (CUdeviceptr)other_buffer, 1, 0) should show the problem.
The cuGetErrorString showing : “Got bad cuda status: invalid argument at line: xxx”

Thanks for sharing the details. We have been able to reproduce the same issue internally and confirmed that it is a existing limitation on CUDA toolkit 12.3 when using CUDA VMM is used. This will be addressed in a future. You can workaround this issue by disabling CUDA VMM in NVSHMEM symmetric heap using NVSHMEM_DISABLE_CUDA_VMM=1.

But for cuStreamWaitValue32/_v2,
" * If the memory was registered via ::cuMemHostRegister(), the device pointer

should be obtained with ::cuMemHostGetDevicePointer(). This function cannot

be used with managed memory (::cuMemAllocManaged)."

I am not sure this applied to this case based on the nvshmem_ptr description.

This doesn’t apply to nvshmem_ptr case as NVSHMEM symmetric heap is resident on the device by default and not managed memory kind.

WenleiBao · April 8, 2024, 10:04pm

I saw a new version released, does that fix this issue?
Thanks

arnavg · April 8, 2024, 10:07pm

Unfortunately no - the fix will be in a future CUDA toolkit version, later this year.

WenleiBao · April 11, 2024, 5:23pm

Good to know. Thanks.

vramesh1 · May 10, 2024, 5:40pm

This will be in the driver that ships with CUDA 12.5 release.

Topic		Replies	Views
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1818	January 2, 2024
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	1218	April 1, 2024
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	1352	January 16, 2024
NVSHMEM isuues with nvshmem_TYPENAME_put GPU-Accelerated Libraries nvshmem	6	735	November 21, 2023
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	2105	August 16, 2022
[nvshmem4py] nvshmem.core.finalize() does not handle everything GPU-Accelerated Libraries nvshmem	6	175	July 7, 2025
BUG: call cudaFree(0) before nvshmem_init() makes nvshmem_barrier_all() fails GPU-Accelerated Libraries nvshmem	6	178	April 19, 2025
NVSHMEM issues with synchronization GPU-Accelerated Libraries nvshmem	5	849	July 18, 2023
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1755	April 24, 2024
Nvshmem4py Buffer can not be freed by python lifetime GPU-Accelerated Libraries nvshmem	4	155	July 6, 2025

Potential NVSHMEM allocated memory performance issue

Related topics