problem with cudaMemcpyPeer() - won't do copying

I have some code executing on 4-GPU system. Each GPU is controlled by single host thread, and at the start of the program cudaDeviceCanAccessPeer() is used to verify that each GPU can access each other GPU memory, and then cudaDeviceEnablePeerAccess() is called, for all pairs, to allow access; also, cudaSetDevice() is properly called in each host thread. Now, at the start of the code, each GPU would calculate values in an array of three integers, stored in corresponding GPU memory, and to be distributed to other GPUs, one integer destined to each of other three GPUs. After corresponding synchronization (barrier) of host threads, each thread will run cudaMemcpyPeer() three times, to copy these integers from other GPU memories to another integer array in own memory. However, it just happens that some of cudaMemcpyPeer() operations (typically, one or two out of 12 in total) won’t do the work - the function would return cudaSuccess, but corresponding location in given GPU memory would just stay unchanged.

I thoroughly checked everything: all return codes, memory contents on the both sending and receiving side, etc. - but to no avail. It’s certainly true that this small exchange could be accomplished through host memory, but numbers exchanges are actually sizes of larger arrays, to be exchanged through the same mechanism as as next step in code. Same code seemingly works fine on another test machine, but with 2-GPU configuration. So - any ideas on what may be the problem here, or any suggestions on how to proceed in debugging it?