I’m using nvshmem to communicate between two GPUs on two H800x1 nodes, connected with infiniband. After the communication, I call nvshmem_quiet, but I get the error ibgda_poll_cq failed with error=5.
I’m confused about why this error occurs and what troubleshooting steps I should take.
Hi hongyu,
Thanks for the feedback. Can I ask that you file this bug report in the NVSHMEM GitHub repository issues section? It will be more visible to the community.
The issue template for the bug also provides guidance on some of the key information we need to be able to properly triage this bug.
Thank you for your reply, although I don’t know why “ibgda_poll_cq failed with error=” was printed. But I found that I did make a mistake when using nvshmem_putmem. The src_ptr I used was not nvshmem symmetric memory, which caused the error.
This issue, caused by the presence or absence of p2p resulting in different available address types when performing nvshmem_put/get, tripped me up once again.
This time, instead of an ibgda_poll_cq failed error, it caused my local team to hang during sync/barrier operations.