kechn
July 11, 2008, 12:58pm
1
Hi, I have some problems using streams. Could someone please help?
I was trying on a simple program and found that one stream didn’t work!(>=2 streams did work though) Here’s my code:
int main()
{
int hstring[4]={1, 2, 3, 4}, nstring[8];
int size=4*sizeof(int);
int *dstring, i;
cudaStream_t stream[1];
cudaMalloc((void **) &dstring, size);
cudaMemcpy(dstring, hstring, size, cudaMemcpyHostToDevice);
cudaStreamCreate(stream);
cudaMemcpyAsync(nstring, dstring, size, cudaMemcpyDeviceToHost, stream[0]);
cudaError_t ret=cudaStreamSynchronize(stream[0]);
cudaMemcpy(nstring+4, dstring, size, cudaMemcpyDeviceToHost);
cudaFree(dstring);
for (i=0; i<8; i++)
printf("new string[%d]=%d\n", i, nstring[i]);
return 0;
}
Here’s what I got:
nstring[0]=134515122
nstring[1]=134520512
nstring[2]=10178336
nstring[3]=-1077008584
nstring[4]=1
nstring[5]=2
nstring[6]=3
nstring[7]=4
So, the data on the device is correct(since nstring[4]~[7] are correct), but why can’t stream[0] transfer it from device to host??
Given one stream can complete an independent task, can I create 3 streams in every circle of a for loop and let them do 3 different tasks??
Thank you for the helps!!!
I don’t have an answer, your code is clear (great example!). I compiled it in WindowsXP, SDK 2.0B2, using a 1.1 compute device, and got the same results as you.
CHanging the Async to regular memcpy works as expected. I rearranged some lines, even put in a manual many-second delay, and the transfer still failed. Since it fails even with a delay, it’s not even a race condition.
As for your question about using 3 streams in a loop, sure, that should work. You picked 3 because your task has computes in stage N that depend on your results from stage N-2?
You should check for error values returned from cudaMemcpyAsync.
I think cudaMemcpyAsync can only work with host memory allocated by cudaMallocHost.
kechn
July 11, 2008, 6:54pm
4
I don’t have an answer, your code is clear (great example!). I compiled it in WindowsXP, SDK 2.0B2, using a 1.1 compute device, and got the same results as you.
CHanging the Async to regular memcpy works as expected. I rearranged some lines, even put in a manual many-second delay, and the transfer still failed. Since it fails even with a delay, it’s not even a race condition.
As for your question about using 3 streams in a loop, sure, that should work. You picked 3 because your task has computes in stage N that depend on your results from stage N-2?
[snapback]409328[/snapback]
Thank you, SPWorley! External Media
By adding output before creating the stream and doing the copy:
for (i=0; i<4; i++)
printf("old nstring[%d]=%d\n", i, nstring[i]);
I found that cudaMemcpyAsync(___, ___, cudaMemcpyDeviceToHost, ____) didn’t change the memory on host at all! No matter how many streams one create.
However, 1 stream(and multiple streams) does work on Async copy from host to device。
I was planning to do like this. In each circle,
stream[0] copies data n+2 from host to device;
stream[1] updates data n+1 on device;
stream[2] copies updated data n from device to host
Now it seems to me that stream[2] would never work… External Media
kechn
July 11, 2008, 7:02pm
5
You should check for error values returned from cudaMemcpyAsync.
I think cudaMemcpyAsync can only work with host memory allocated by cudaMallocHost.
[snapback]409372[/snapback]
Thank you so much, MisterAnderson42! External Media
It does work!!!
I should read guide more carefully :blush: