VERY strange cudaMemcpy() behavior

I have run into something very weird, which seems to happen intermittently.

I have a multithreaded app built using openmp (one thread per device), and sometimes memcpys between two devices fail.

Observe the output below:

int n = omp_get_thread_num();
int ngpu = omp_get_num_threads();
int last=(n+ngpu-1)%ngpu;
buffer.reshape(dims0, dims1);
CUDA(cudaDeviceSynchronize());
#pragma omp barrier
this_ptr[n] = buffer.data;
CUDA(cudaDeviceSynchronize());
#pragma omp barrier
float *ptr 		= data;
loat *tmp_last = this_ptr[last];
float *tmp 		= this_ptr[n];
float alpha=1.0f;
float src=0;
float dst=0;
int bytes=16;
MSG("Writing %d bytes from gpu %d to gpu %d, pointers %p to %p", bytes, n, last, ptr + start, tmp_last + start);
CUDA(cudaMemcpy(tmp_last + start, ptr + start, bytes, cudaMemcpyDeviceToDevice));
CUDA(cudaDeviceSynchronize());
#pragma omp barrier

CUDA(cudaMemcpy(&src,ptr+start,4,cudaMemcpyDefault));
CUDA(cudaMemcpy(&dst,tmp_last+start,4,cudaMemcpyDefault));
MSG("%6f,%6f",src,dst);

Produces the following output:

darray_v2.cpp(806):Writing 16 bytes from gpu 4 to gpu 3, pointers 0x2302c02830 to 0x2301e01c30
darray_v2.cpp(806):Writing 16 bytes from gpu 5 to gpu 4, pointers 0x2302a02840 to 0x2302c02a40
darray_v2.cpp(806):Writing 16 bytes from gpu 1 to gpu 0, pointers 0x2302e01a00 to 0x2301f01c00
darray_v2.cpp(806):Writing 16 bytes from gpu 7 to gpu 6, pointers 0x2302802860 to 0x2302002a60
darray_v2.cpp(806):Writing 16 bytes from gpu 0 to gpu 7, pointers 0x2301f01a70 to 0x2302802a70
darray_v2.cpp(806):Writing 16 bytes from gpu 6 to gpu 5, pointers 0x2302002850 to 0x2302a02a50
darray_v2.cpp(806):Writing 16 bytes from gpu 3 to gpu 2, pointers 0x2301e01a20 to 0x2302601c20
darray_v2.cpp(806):Writing 16 bytes from gpu 2 to gpu 1, pointers 0x2302601a10 to 0x2302e01c10
darray_v2.cpp(814):169203.000000,169203.000000
darray_v2.cpp(814):169719.000000,169719.000000
darray_v2.cpp(814):159665.000000,0.000000
darray_v2.cpp(814):166255.000000,166255.000000
darray_v2.cpp(814):168117.000000,168117.000000
darray_v2.cpp(814):145700.000000,145700.000000
darray_v2.cpp(814):170900.000000,170900.000000
darray_v2.cpp(814):169378.000000,169378.000000

So the outputs above are simply the first element of the source and destination copied to host, and should in theory match one-to-one, but for some reason, one of the threads did not copy the data over.

Here’s a rerun

darray_v2.cpp(806):Writing 16 bytes from gpu 5 to gpu 4, pointers 0x2302802840 to 0x2302e02a40
darray_v2.cpp(806):Writing 16 bytes from gpu 2 to gpu 1, pointers 0x2302601a10 to 0x2302101c10
darray_v2.cpp(806):Writing 16 bytes from gpu 4 to gpu 3, pointers 0x2302e02830 to 0x2302c01c30
darray_v2.cpp(806):Writing 16 bytes from gpu 0 to gpu 7, pointers 0x2302401a70 to 0x2301f02a70
darray_v2.cpp(806):Writing 16 bytes from gpu 7 to gpu 6, pointers 0x2301f02860 to 0x2302a02a60
darray_v2.cpp(806):Writing 16 bytes from gpu 3 to gpu 2, pointers 0x2302c01a20 to 0x2302601c20
darray_v2.cpp(806):Writing 16 bytes from gpu 1 to gpu 0, pointers 0x2302101a00 to 0x2302401c00
darray_v2.cpp(806):Writing 16 bytes from gpu 6 to gpu 5, pointers 0x2302a02850 to 0x2302802a50
darray_v2.cpp(814):169719.000000,0.000000
darray_v2.cpp(814):145700.000000,145700.000000
darray_v2.cpp(814):169378.000000,169378.000000
darray_v2.cpp(814):166255.000000,166255.000000
darray_v2.cpp(814):168117.000000,168117.000000
darray_v2.cpp(814):170900.000000,170900.000000
darray_v2.cpp(814):159665.000000,0.000000
darray_v2.cpp(814):169203.000000,169203.000000

HELP!!!

I have appeared to fix the problem by NOT enabling peer to peer between the devices.

Previously,
15 executions yielded 12 with errors

Now,
10 executions for 0 errors.

Can never say for sure with these problems but it LOOKS to have gone away.