My application got a performance loss after installing the new driver so I tried to write a smaller program to reproduce the effect. Could you help me investigating the reason ?
The program basically do reduction on both gpu, then send result from gpu1 to gpu0 and reduction with its own result. I separate the result into multiple part for overlapping the computation and communication and synchronize using event and stream.
I attached the code with the post. I use 2 x S2050 with ECC off on redhat 2.6.18-128.1.14.el5