cudaMemcpyAsync clarification required & help needed

The CUDA 2.2 guide says following

" Two commands from different streams cannot run concurrently if either a pagelocked

host memory allocation, a device memory allocation, a device memory set, a

device ↔ device memory copy, or any CUDA command to stream 0 is called inbetween

them by the host thread."

Q1. Does above mean that I can not use a cudaMemcpy between two cudaMemcpyAsync ?

I have following situation

function XYZ()

{

//

.....

//allocate memory to device variables 

........

//generate streams

cudaCreateStream(&stream1);

cudaCreateStream(&stream2);

//allocate page locked memory to host pointers.

cudaMallocHost( (void**)&tempFrame , frame_size); 

cudaMallocHost( (void**)&h_Samples_Data , all_samples_data_size); 

//Sync operation

	err = cudaGetLastError();

		cudaMemcpy(d_SampleAttrib, h_SampleAttrib, attrib_size, cudaMemcpyHostToDevice);

	err = cudaGetLastError();

//Sync operation

	err = cudaGetLastError();

		cudaMemcpy(d_ProjMat, proj_mat_data, cDim * iSampleLen * sizeof(float), cudaMemcpyHostToDevice);

	err = cudaGetLastError();

//Async operation

	err = cudaGetLastError();

		cudaMemcpyAsync(d_ImgFrameData, tempFrame, frame_size, cudaMemcpyHostToDevice, stream1);

	err = cudaGetLastError();

//Query whether all operations in stream 1 are done  

	 err = cudaGetLastError();

			cudaStreamQuery(stream1);

	 err = cudaGetLastError();

//Kernell calling

	foo<<<BLOCKS , THREADS>>>(.....)

//destroy stream 1

cudaStreamDestroy(stream1);

//copy back processed data to another host pointer with page locked memory.

cudaMemcpyAsync(h_SamplesData, d_SamplesData, all_samples_data_size, cudaMemcpyDeviceToHost , stream2);

//destroy stream2

cudaStreamDestroy(stream2);

//remaining code

free memory etc. 

}

Above code segment is called every time a new frame arrives through a video file. The sequence of operations runs successfully for the first time only.

At the second time, I am greeted with exception error at the first cudaMemcpy() operation.

I tried to debug using device_emulation mode and getting error. I am also checking values in the device variables viz: d_SampleAttrib, d_ProjMat and d_ImgFrameData.

FYI: (This may help for spotting error)

For first iteration, the values in each of these device variables are different. At the second iteration, when the device variables are allocated memory again, I find the values in all three are same.

Q2. Can you guide me where am I going wrong?

After reading the guide, I moved my async operation after cudaMemcpy operation. Still I am not successful.

I wrote another simple program where I am storing and retrieving data from host to device and vice versa using cudaMempcyAsync and it is working well. But I never experienced to mix cudaMemcpy and cudaMemcpyAsync operations together.

If more information is needed please tell me.Thanks all.