Problem using streams Can't get more than one stream to work

I’m trying to use streams to run the same kernel on several sets of data, but can’t seem to get it to work when I set the number of streams to greater than one.

I’ve extracted the relevant portions of the code below.

 // Setup event handles

  cudaEvent_t start_event, stop_event;

  CUDA_SAFE_CALL(cudaEventCreate(&start_event));

  CUDA_SAFE_CALL(cudaEventCreate(&stop_event));

  float total_time;

 // allocate and initialize an array of stream handles

  cudaStream_t *streams = (cudaStream_t*) malloc(bigBlocks * sizeof(cudaStream_t));

  for (int i=0; i < nstreams; i++)

    {

      CUDA_SAFE_CALL(cudaStreamCreate(&(streams[i])));

    }

 // Copy to device

  CUDA_SAFE_CALL(cudaMemcpy(Td,T,data_size*nstreams,cudaMemcpyHostToDevice));

  CUDA_SAFE_CALL(cudaMemcpy(d,a,data_size*nstreams,cudaMemcpyHostToDevice));

 cudaEventRecord(start_event,0);

  for (int i = 0; i < nstreams; i++)

    {

      // Process data    

 cholesky_kernel<<<nBlocks,blockSize,0,streams[i]>>>(outputd+i*outsize,Td+i*data_size,d+i*data_size,padM);

      CUT_CHECK_ERROR("Kernel execution failed.");

    }

 CUDA_SAFE_CALL(cudaMemcpy(output,outputd,outsize*nstreams,cudaMemcpyDeviceToHost));

  cudaEventRecord(stop_event,0);

  cudaEventSynchronize(stop_event);

  CUDA_SAFE_CALL(cudaEventElapsedTime(&total_time,start_event,stop_event));

When I run the above with nstreams >1, the execution time is 0.000000, which seems to indicate that the kernel hasn’t launched. When nstreams = 1, the execution time is ~8.

I can get simpleStreams from the SDK to run fine.

Is there something I’m missing? Are there limits to the number of streams that can run at one time?

I think my previous post was in the wrong forum…

In any case, my bad. I was modifying the pointers into the matrices wrong, incrementing by actual memory size versus number of elements.

host data must defined with

cudaMallocHost((void**)&T, data_size*nstreams);

cudaMallocHost((void**)&a, data_size*nstreams);
// Copy to device

CUDA_SAFE_CALL(cudaMemcpy(Td,T,data_size*nstreams,cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d,a,data_size*nstreams,cudaMemcpyHostToDevice));

in this case you must use function

cudaMemcpyAsync();

if you want to use only one stream for copying data from host to device

cudaMemcpyAsync(Td,T,data_size*nstreams,cudaMemcpyHostToDevice, streams[0]);

cudaMemcpyAsync(d,a,data_size*nstreams,cudaMemcpyHostToDevice, streams[0]);

if you use more stream

for(int i  = 0; i < nstreams; i++)

{

cudaMemcpyAsync(Td + i*numOfElementsPerStream,T + i*numOfElementsPerStream, data_size, cudaMemcpyHostToDevice,streams[i]); 

cudaMemcpyAsync(d+ i*numOfElementsPerStream, a + i*numOfElementsPerStream, data_size, cudaMemcpyHostToDevice,streams[i]);  

}

numOfElementsPerStream: how many elements for one stream

in your case numOfElementsPerStream = data_size/sizeof(Type)

good luck to you :)

I’ve got a similar problem, in that the streams dont seem to be working.

http://forums.nvidia.com/index.php?showtopic=79052

Can anyone provide any help?

Cheers