Questions on Streams

Hi, I have some problems using streams. Could someone please help?

I was trying on a simple program and found that one stream didn’t work!(>=2 streams did work though) Here’s my code:

int main()
{
int hstring[4]={1, 2, 3, 4}, nstring[8];
int size=4*sizeof(int);
int *dstring, i;
cudaStream_t stream[1];

cudaMalloc((void **) &dstring, size);

cudaMemcpy(dstring, hstring, size, cudaMemcpyHostToDevice);

cudaStreamCreate(stream);
cudaMemcpyAsync(nstring, dstring, size, cudaMemcpyDeviceToHost, stream[0]);
cudaError_t ret=cudaStreamSynchronize(stream[0]);

cudaMemcpy(nstring+4, dstring, size, cudaMemcpyDeviceToHost);

cudaFree(dstring);

for (i=0; i<8; i++)
    printf("new string[%d]=%d\n", i, nstring[i]);
   
return 0;

}

Here’s what I got:
nstring[0]=134515122
nstring[1]=134520512
nstring[2]=10178336
nstring[3]=-1077008584
nstring[4]=1
nstring[5]=2
nstring[6]=3
nstring[7]=4

So, the data on the device is correct(since nstring[4]~[7] are correct), but why can’t stream[0] transfer it from device to host??

Given one stream can complete an independent task, can I create 3 streams in every circle of a for loop and let them do 3 different tasks??

Thank you for the helps!!!

I don’t have an answer, your code is clear (great example!). I compiled it in WindowsXP, SDK 2.0B2, using a 1.1 compute device, and got the same results as you.

CHanging the Async to regular memcpy works as expected. I rearranged some lines, even put in a manual many-second delay, and the transfer still failed. Since it fails even with a delay, it’s not even a race condition.

As for your question about using 3 streams in a loop, sure, that should work. You picked 3 because your task has computes in stage N that depend on your results from stage N-2?

You should check for error values returned from cudaMemcpyAsync.

I think cudaMemcpyAsync can only work with host memory allocated by cudaMallocHost.

Thank you, SPWorley! :heart:

By adding output before creating the stream and doing the copy:

for (i=0; i<4; i++)

    printf("old nstring[%d]=%d\n", i, nstring[i]);

I found that cudaMemcpyAsync(___, ___, cudaMemcpyDeviceToHost, ____) didn’t change the memory on host at all! No matter how many streams one create.

However, 1 stream(and multiple streams) does work on Async copy from host to device。

I was planning to do like this. In each circle,

stream[0] copies data n+2 from host to device;

stream[1] updates data n+1 on device;

stream[2] copies updated data n from device to host

Now it seems to me that stream[2] would never work… :no:

Thank you so much, MisterAnderson42! :clap:

It does work!!!

I should read guide more carefully :blush:

Made separate topic