About team API

Hi,
I’m wondering if you could provide an example of using nvshmemx_team_init and nvshmem_team_destroy? It looks to me that all team members should call nvshmemx_team_init, but only one of the team members needs to call nvshmem_team_destroy. I don’t know if I’m right about this.

In my case, I have 4 PEs, and I would randomly choose 2 PEs to form a team, do something about it and destroy the team. At the mean time the other 2 PEs simply would be waiting at a call of nvshmem_barrier_all.

I observe that if multiple members call nvshmem_team_destroy, I get an error of: non-zero status: 1 cuMulticastUnbind failed for mc_offset 0 on device 1.

It would be very helpful if you could give me some insight on this or provide me with an example using nvshmem_team APIs.

Thanks!

only one of the team members needs to call nvshmem_team_destroy. I don’t know if I’m right about this.

From the docs:

The nvshmem_team_destroy routine is a collective operation that destroys the team referenced by the team handle argument team. Upon return, the referenced team is invalid.

Collective meaning all the members of the team will need to call it.

I observe that if multiple members call nvshmem_team_destroy, I get an error of: non-zero status: 1 cuMulticastUnbind failed for mc_offset 0 on device 1.

This is surprising to me. Can you add some logs with NVSHMEM_DEBUG=INFO?

Could you attach a simple reproducer so I can try and run it to observe the issue?

Thanks