Potential NVSHMEM allocated memory performance issue

Hello,

We seems to find some perf issue when using NVSHMEM allocated memory, e.g. nvshmem_malloc, for memory load operations, such as load data from global memory to shared memory, say 4096 x 12288.
The potential issue might related to ptx instructions cp async:
" cp.async.cg.shared.global.L2::128B [%1], [%2], %3;\n" and / or (we actually tested both)
" cp.async.cg.shared.global [%1], [%2], %3;\n"

%3 is 16 for both instructions.

So If we use allocated buffer from nvshmem_malloc vs normal torch allocated one, we observe ~1ms difference, meaning with nvshmem_malloc’ed buffer, the code take ~1ms more to finish. Same issue is observed even with nvshmem_align API.

After some dig, we observe the difference gone if we comment out the above cp async part of code. Meaning if we don’t use above ptx inst, switching nvshmem_malloc memory vs normal one makes no difference (the program still doing work and takes time.)

Is there any difference for nvshmem_malloc memory that could cause cp async ptx instrucitons different behaviors? Can anyone please help with it?

Many Thanks.

Thanks for reaching out. What GPU are you testing with?

One possible issue could be that the buffer alignment is coming out differently between the two allocators that you used. Can you check that the buffer is 16B aligned in both cases?

Thanks for your reply. My testing is on A100 GPUs.

One possible issue could be that the buffer alignment is coming out differently between the two allocators that you used.

As I mentioned, I have also tested the memory allocated via nvshmem_align. It still have extra overhead unfortunately.

Can you check that the buffer is 16B aligned in both cases?

Yes. Otherwise, the ptx instruction will complain about memory alignment issue.

Could you please try setting the environment variable NVSHMEM_DISABLE_CUDA_VMM=1? This disables automatic symmetric heap sizing, so you may also need to increase NVSHMEM_SYMMETRIC_SIZE. More information on these env vars is available here: Environment Variables — NVSHMEM 2.10.1 documentation

Thanks for your kind reply! I will test it later and get back to you. Thank you!

Sorry for the late reply. I have verified that with dynamic heap disabled NVSHMEM_DISABLE_CUDA_VMM=1. The runtime delay disappears.

Also searched a bit in the code, and it shows “Disabling VMM usage (dynamic heap) by setting NVSHMEM_DISABLE_CUDA_VMM=1 could provide better performance.”

I am wondering is there any other side effects that disable dynamic heap as default?

Thank you!

BTW I found another potential issue with NVShmem allocated buffer:
If I allocated a buffer using nvshmem_malloc, let’s say a 64 int buffer, and ptr is the pointer.
And then I passed that ptr to CUStreamWriteValue API.

Compilation is fine, but when I run it.
I got “Got bad cuda status: invalid argument” if I don’t export NVSHMEM_DISABLE_CUDA_VMM=1.
It would work if I export that env.

Is that expected behavior ? or I just forgot something else when using nvshmem_malloc?

Thanks!

Regarding your first question.
Disabling CUDA VMM should not have other impact other than the fact that you need to set the NVSHMEM_SYMMETRIC_SIZE env var. Using CUDA VMM allows us to dynamically expand NVSHMEM symmetric heap and hence eliminates the need to set NVSHMEM_SYMMETRIC_SIZE.

Regarding cuStreamWriteValue usage, we are not aware of any issue. However, I see that certain flags for this API are not supported on certain platforms. Can you please check for that?
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gf16864e8693d888f8178067470001b21

Thanks for your reply!
Regarding to the NVSHMEM_SYMMETRIC_SIZE env var, could you please explain the below sentence in the doc :
"The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. "
I am not quite sure I understand what is the prefix and scaling factor here means? (maybe an example would be great)

As for cuStreamWriteValue usage: yes. I am aware of that, and we use the flags that supported. That means using shared memory with IPC handle works.

Thanks.

The format allowed for symmetric heap size is prefix scaling. For example, once can set NVSHMEM_SYMMETRIC_SIZE to be 1.1G or 1.2KB, etc. NVSHMEM would expand it to an integer in bytes and round it up.

Hello,
Sorry to bother. But I have a followup question regarding the NVSHMEM_DISABLE_CUDA_VMM=1 option:
When I set this env var, I hit this issue
"non-zero status: 1 cudaIpcOpenMemHandle failed with error 1 ", at nvshmem/src/host/transport/p2p/p2p.cpp:251,
if I init a 2 node (each have 4 gpus) environment with torch, but it is fine with 1 node each with 8 gpus env.

Note that it failed at runtime, and I use CUDA_VISIBLE_DEVICES to control the view of the node: either 1 node with 8 gpus or 2 nodes with 4 gpus each.

Is this expected ? or maybe I missed something?

Thanks.

When I set this env var, I hit this issue
"non-zero status: 1 cudaIpcOpenMemHandle failed with error 1 ", at nvshmem/src/host/transport/p2p/p2p.cpp:251,
if I init a 2 node (each have 4 gpus) environment with torch, but it is fine with 1 node each with 8 gpus env. Note that it failed at runtime, and I use CUDA_VISIBLE_DEVICES to control the view of the node: either 1 node with 8 gpus or 2 nodes with 4 gpus each.

Can you please share the run/launch command along with the runtime environment variables that you are using to launch these processes on different and same nodes ? From the log message, it sounds like you are trying to use P2P transport to share memory b/w GPUs on different nodes, which is not supported on A100 based platform.

Hello,
Sorry for the late replay. Have to context switch to other stuff.

To summarize the problem that we met:

  1. If we set NVSHMEM_DISABLE_CUDA_VMM=1, the memory allocated from nvshm_malloc, and the ptr obtain from nvshmem_ptr() is viewed as invalid argument for driver API cuStreamWaitValue32/_v2, and crash there. (but cuStreamWriteValue32 did not crash )
  2. If not set NVSHMEM_DISABLE_CUDA_VMM, the ptr from nvshmem_ptr() is not viewed as valid arg for cuStreamWriteValue32/_v2 and crashed. (cuStreamWaitValue32 did not crash)

These two cases prevent me using NvShmem to allocate memory in my test.
So I am wondering is there any known issue for above observation?

Thanks.

Thanks for sharing the insight @WenleiBao if possible, please share an example code of your implementation to better understand the details and build/runtime environment.

For reference, this is example code with 2 PE that I wrote to test to test the hypothesis on A100 GPU with CUDA 11.8 and 520.61.05 CUDA driver.

#define TEST_CUCHECK(fn)                                                              \
    do {                                                                              \
        CUresult err = fn;                                                            \
        if (err != CUDA_SUCCESS) {                                                    \
            const char *errStr;                                                       \
            cuGetErrorString(err, &errStr);                                           \
            fprintf(stderr, "%s:%d Cuda failure '%s'\n", __FILE__, __LINE__, errStr); \
            return -1;                                                                \
        }                                                                             \
    } while (false)

....
nvshmem_init();
// select a CUDA device
// assuming size is 1 byte
char *buffer = (char *)nvshmem_malloc(size);
if (!buffer) {
    fprintf(stderr, "nvshmem_malloc failed \n");
    // goto error path
}

TEST_CUCHECK(cuStreamWriteValue32(0, (CUdeviceptr)buffer, 1, 0));
TEST_CUCHECK(cuStreamWaitValue32(0, (CUdeviceptr)buffer, 1, 0));
nvshmem_free(buffer);
nvshmem_finalize();
  1. If we set NVSHMEM_DISABLE_CUDA_VMM=1, the memory allocated from nvshmem_malloc, and the ptr obtain from nvshmem_ptr() is viewed as invalid argument for driver API cuStreamWaitValue32/_v2, and crash there. (but cuStreamWriteValue32 did not crash )

Can you elaborate with an example how you are using nvshmem_ptr in the above ? The example that I shared above works as expected if CUDA VMM is disabled. Can you share what errorString is reported by cuGetErrorString after invoking cuStreamWriteValue32 ?

Thanks for your reply @arnavg !
Yes, so if your example has two PE. Then after nvshmem_malloc.
You can get the buffer ptr for the other PE right?
like char *other_buffer = (char *) nvshmem_ptr(buffer, 1);
Then call cuStreamWaitValue32 (0, (CUdeviceptr)other_buffer, 1, 0) should show the problem.
The cuGetErrorString showing : “Got bad cuda status: invalid argument at line: xxx”

From nvshmem_ptr:
https://docs.nvidia.com/nvshmem/api/gen/api/setup.html?highlight=nvshmem_ptr#c.nvshmem_ptr
it says that " nvshmem_ptr returns an address that may be used to directly reference dest on the specified PE. This address can be assigned to a pointer. After that, ordinary loads and stores to dest may be performed." and also " The address returned by nvshmem_ptr is a local address to a remotely accessible data object."

But for cuStreamWaitValue32/_v2,
" * If the memory was registered via ::cuMemHostRegister(), the device pointer

  • should be obtained with ::cuMemHostGetDevicePointer(). This function cannot
  • be used with managed memory (::cuMemAllocManaged)."

I am not sure this applied to this case based on the nvshmem_ptr description. But if I use IPC handle to get the remote PE ptr and pass to cuStreamWaitValue32/_v2, it works.

Thanks.

Yes, so if your example has two PE. Then after nvshmem_malloc.
You can get the buffer ptr for the other PE right?
like char *other_buffer = (char *) nvshmem_ptr(buffer, 1);
Then call cuStreamWaitValue32 (0, (CUdeviceptr)other_buffer, 1, 0) should show the problem.
The cuGetErrorString showing : “Got bad cuda status: invalid argument at line: xxx”

Thanks for sharing the details. We have been able to reproduce the same issue internally and confirmed that it is a existing limitation on CUDA toolkit 12.3 when using CUDA VMM is used. This will be addressed in a future. You can workaround this issue by disabling CUDA VMM in NVSHMEM symmetric heap using NVSHMEM_DISABLE_CUDA_VMM=1.

But for cuStreamWaitValue32/_v2,
" * If the memory was registered via ::cuMemHostRegister(), the device pointer

  • should be obtained with ::cuMemHostGetDevicePointer(). This function cannot
  • be used with managed memory (::cuMemAllocManaged)."

I am not sure this applied to this case based on the nvshmem_ptr description.

This doesn’t apply to nvshmem_ptr case as NVSHMEM symmetric heap is resident on the device by default and not managed memory kind.

I saw a new version released, does that fix this issue?
Thanks

Unfortunately no - the fix will be in a future CUDA toolkit version, later this year.

Good to know. Thanks.

This will be in the driver that ships with CUDA 12.5 release.