Multi-Gpu code crashing

I am having some big problems regarding multi-gpu code - specifically, a reduce + broadcast of a 256MB array using multi-stream memcpyasyncs.

What is happening with individual executions of my code - which at the moment is only creating a random 256MB array on each gpu and reducing+broadcasting 10 times before exiting - is that it does one of four things

  1. Runs properly
  2. Runs properly but very slowly
  3. Kernels crash and it runs very slowly
  4. Complete system lock on the first iteration

What’s more, the above behavior happens on both windows AND linux, but FURTHERMORE, due to the fact that I simply could not get it to run reliably, I had contracted another programmer to do some work for me, and he is hitting the same issue on his completely separate linux box.

Regarding further specifics, I am initializing (ngpu) streams for each of (ngpu) gpus as well as (ngpu)*(ngpu) events which tie the code together.

Are there known issues with multi-gpu code of which I should be made aware?

Update:
The crash seems to happen more on linux. Factors that may be linked to the problem:
Using cudaMemCpyAsync(…) in conjunction with cudaEnablePeerAccess(…)