Thanks for your helpful suggestion.
I have question about the overhead of using the blocking GET and non-blocking GET. When we use the GET, it basically have two phase, one for send the request and the second phase for waiting for the results coming back, while non-blocking GET we still need to use the
barrier for synchronization if we need those non-blocking GET to finish.
is there any statistics of the latency in cycles for each of these blocking/nonblocking GET APIs?
if each time I transfer a batch of data (like a matrix) instead of a single vector, will the transferring performance of NVSHMEM will be improved? I think there might be some tradeoff between the transferring performance (throughput) versus the overlapping of computation and transferring. (i.e. more coarse-grained bulky data transferring would benefit the throughput while more fine-grained data transferring would improve the overlapping of computation and transferring)
does this claim make sense?
Also compare GET with PUT, does PUT comes with lower overhead compared with GET?
Thanks a lot for your great help!