Synchronization problem

Hello,

In my code, I am using 2 streams.

On each stream, Following four things are done:
Step 1. Copy data from host to the device using cudaMemcpyAsync
Step 2. 5 kernels are run
Step 3. Some device Memory is cleared using cudaMemsetAsync
Step 4. 7 kernels are run

Step 5: 2 streams are synchronized to the host using CudaDeviceSychronize.

After synchronization,

Step 6: On the 2nd stream, one kernel is run and data is copied from the device to the host. (the last kernel’s input comes from the output of both the streams).

I have two machines:

  1. OS: Windows 7 64 bit, enterprise
    Cuda Toolkit : 4.2
    GPU: GTX 480
    Visual studio 2010 professional.

  2. OS: Windows 7 64 bit, enterprise
    Cuda Toolkit : 4.2
    GPU: GTX 580
    Visual studio 2010 professional.

On machine 1, code runs without any problem.

On machine 2, code gives synchronization error at step 5. But if initialize two stream variables to 0, the code runs without any problem.
I have placed cudaDeviceSynchronize before and after the last kernel of step 4. Synchronization after the kernel is failing.

Can anybody explain why its happening like this?

Thanks,
Venkateswarlu