BUG: call cudaFree(0) before nvshmem_init() makes nvshmem_barrier_all() fails

I found no such thing “call cudaSetDevice first, call nvshmem_init then” from the doc, did I miss something?

and if cudaSetDevice is neccessary, maybe it should resident int the nvshmem_init implementation, and avoid the user to manually add it?

or some WARNING messages helps a lot. It toke me so long to find the problem.