Problem using streams Can't get more than one stream to work

lchu · October 3, 2008, 3:02am

I’m trying to use streams to run the same kernel on several sets of data, but can’t seem to get it to work when I set the number of streams to greater than one.

I’ve extracted the relevant portions of the code below.

 // Setup event handles

  cudaEvent_t start_event, stop_event;

  CUDA_SAFE_CALL(cudaEventCreate(&start_event));

  CUDA_SAFE_CALL(cudaEventCreate(&stop_event));

  float total_time;

 // allocate and initialize an array of stream handles

  cudaStream_t *streams = (cudaStream_t*) malloc(bigBlocks * sizeof(cudaStream_t));

  for (int i=0; i < nstreams; i++)

    {

      CUDA_SAFE_CALL(cudaStreamCreate(&(streams[i])));

    }

 // Copy to device

  CUDA_SAFE_CALL(cudaMemcpy(Td,T,data_size*nstreams,cudaMemcpyHostToDevice));

  CUDA_SAFE_CALL(cudaMemcpy(d,a,data_size*nstreams,cudaMemcpyHostToDevice));

 cudaEventRecord(start_event,0);

  for (int i = 0; i < nstreams; i++)

    {

      // Process data    

 cholesky_kernel<<<nBlocks,blockSize,0,streams[i]>>>(outputd+i*outsize,Td+i*data_size,d+i*data_size,padM);

      CUT_CHECK_ERROR("Kernel execution failed.");

    }

 CUDA_SAFE_CALL(cudaMemcpy(output,outputd,outsize*nstreams,cudaMemcpyDeviceToHost));

  cudaEventRecord(stop_event,0);

  cudaEventSynchronize(stop_event);

  CUDA_SAFE_CALL(cudaEventElapsedTime(&total_time,start_event,stop_event));

When I run the above with nstreams >1, the execution time is 0.000000, which seems to indicate that the kernel hasn’t launched. When nstreams = 1, the execution time is ~8.

I can get simpleStreams from the SDK to run fine.

Is there something I’m missing? Are there limits to the number of streams that can run at one time?

lchu · October 3, 2008, 3:21am

I think my previous post was in the wrong forum…

In any case, my bad. I was modifying the pointers into the matrices wrong, incrementing by actual memory size versus number of elements.

Quoc_Vinh · October 3, 2008, 3:59am

host data must defined with

cudaMallocHost((void**)&T, data_size*nstreams);

cudaMallocHost((void**)&a, data_size*nstreams);

// Copy to device

CUDA_SAFE_CALL(cudaMemcpy(Td,T,data_size*nstreams,cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d,a,data_size*nstreams,cudaMemcpyHostToDevice));

in this case you must use function

cudaMemcpyAsync();

if you want to use only one stream for copying data from host to device

cudaMemcpyAsync(Td,T,data_size*nstreams,cudaMemcpyHostToDevice, streams[0]);

cudaMemcpyAsync(d,a,data_size*nstreams,cudaMemcpyHostToDevice, streams[0]);

if you use more stream

for(int i  = 0; i < nstreams; i++)

{

cudaMemcpyAsync(Td + i*numOfElementsPerStream,T + i*numOfElementsPerStream, data_size, cudaMemcpyHostToDevice,streams[i]); 

cudaMemcpyAsync(d+ i*numOfElementsPerStream, a + i*numOfElementsPerStream, data_size, cudaMemcpyHostToDevice,streams[i]);  

}

numOfElementsPerStream: how many elements for one stream

in your case numOfElementsPerStream = data_size/sizeof(Type)

good luck to you :)

AFlare1 · October 8, 2008, 5:28pm

I’ve got a similar problem, in that the streams dont seem to be working.

[url=“http://forums.nvidia.com/index.php?showtopic=79052”]http://forums.nvidia.com/index.php?showtopic=79052[/url]

Can anyone provide any help?

Cheers