Nvshmem support teamwise malloc

Description

Although NVSHMEM now supports the concept of teams, nvshmem_malloc still does not allow memory allocation at the team level — it requires all PEs in the global world to allocate memory collectively. This significantly limits the usefulness of the team abstraction. When does NVSHMEM plan to support nvshmem_malloc with team-level granularity?

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered