I am having some big problems regarding multi-gpu code - specifically, a reduce + broadcast of a 256MB array using multi-stream memcpyasyncs.
What is happening with individual executions of my code - which at the moment is only creating a random 256MB array on each gpu and reducing+broadcasting 10 times before exiting - is that it does one of four things
- Runs properly
- Runs properly but very slowly
- Kernels crash and it runs very slowly
- Complete system lock on the first iteration
What’s more, the above behavior happens on both windows AND linux, but FURTHERMORE, due to the fact that I simply could not get it to run reliably, I had contracted another programmer to do some work for me, and he is hitting the same issue on his completely separate linux box.
Regarding further specifics, I am initializing (ngpu) streams for each of (ngpu) gpus as well as (ngpu)*(ngpu) events which tie the code together.
Are there known issues with multi-gpu code of which I should be made aware?