Questions on Stream Management

hoixhoi · September 17, 2008, 10:33am

I need to process large amount of data and if I process all the data once, the program will be crashed. So I followed the Stream Management in Programming Guide to divide the data into several streams and process a small portion of data once. It works if the number of streams is small, but fails when it becomes large, say 8. :(

Thanks for answering. :D

E.D_Riedijk · September 17, 2008, 11:37am

I think you will have to explain better what you are doing, give some code examples, etc. before anybody can make an educated guess at what is wrong.

pstach · September 18, 2008, 10:41pm

I believe what he’s trying to say is if you increase the number of streams in the cuda sdk streams example from 2 to 8 that it behaves oddly. I’ve noticed this as well. In fact, I’ve yet to find a way to reliably use streams in real world scenerios and always end up reverting to a custom queuing scheme. Further cudaStreamQuery and cudaEventQuery are unreliable with > 2 events or in programs with more than one thread. I’ve submitted numerous reproduction examples to nVidia.

-Patrick

Quoc_Vinh · September 19, 2008, 3:57am

I get the same situation. :(

My program can only work with 512elements * 4 streams.

if the number of elements per stream is greater than 512elements, my program doesn’t work.

this is my code.

#include <stdio.h>

#include <cutil.h>

#define DATA  512 //only work with DATA is smaller than 512

extern "C"

__global__ void CudaStreamKernel(unsigned int *inputArray, unsigned int numOfElements)

{	

	unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int totalThreads = gridDim.x * blockDim.x;

	unsigned int numOfProcessElements = numOfElements / totalThreads + 1;

	for(unsigned int i = 0; i < numOfProcessElements; i++)

	{

  unsigned int tempIndex = idx + totalThreads * i;

  if(tempIndex <= numOfElements)

  {

  	inputArray[tempIndex] =  123;

  }

	}

}

//----------------------------------------------------------------------------------

void CudaStream(unsigned int multipleOf1MB, 

    unsigned int numOfStreams, /*4*/

    unsigned int numOfThreads, /*512*/

    unsigned int numOfBLocks)/*1024*/

{

	unsigned int numOfElements =DATA;

	unsigned int size = numOfElements * sizeof(unsigned int);

	unsigned int streamSize = size / numOfStreams; 

	unsigned int* hostPtr = 0; 

	cudaError_t hostAllocate = cudaMallocHost((void**)&hostPtr, size); 

	

	unsigned int* devicePtr;

    	cudaError_t deviceAllocate = cudaMalloc((void**)&devicePtr, size); 

  if(deviceAllocate == cudaSuccess)

  {

  cudaMemset(devicePtr, 0, size );

  }

	

	//create streams

	cudaStream_t *streams;

	streams = (cudaStream_t*) malloc(numOfStreams * sizeof(cudaStream_t));

	cudaError_t streamCreate;

	for (unsigned int i = 0; i < numOfStreams; i++) 

	{

  streamCreate = cudaStreamCreate(&(streams[i])); 

	}

	

	for (unsigned int i = 0; i < numOfStreams; ++i) 

	{

    	cudaMemcpyAsync(devicePtr + i * numOfElements / numOfStreams, 

    	hostPtr + i * numOfElements / numOfStreams, 

    	streamSize, 

    	cudaMemcpyHostToDevice, 

    	streams[i]); 

	}

	

	dim3 dimBlock(numOfThreads, 1, 1);

	dim3 dimGrid(numOfBLocks, 1, 1);

	if(streamCreate == cudaSuccess)

	{

  for (unsigned int i = 0; i < numOfStreams; i++) 

  {

  	CudaStreamKernel<<< dimGrid, dimBlock, 0, streams[i] >>>(devicePtr + i * streamSize, numOfElements);  

  }

	}

	if(streamCreate == cudaSuccess)

	{

  for (unsigned int i = 0; i < numOfStreams; i++) 

  {

  cudaMemcpyAsync(hostPtr + i * numOfElements / numOfStreams, 

      devicePtr + i * numOfElements / numOfStreams, 

      streamSize, 

      cudaMemcpyDeviceToHost, 

      streams[i]);

  }

	}

	

	cudaThreadSynchronize();

	for(unsigned int i = 0; i < numOfStreams; i++)

	{

  cudaStreamDestroy(streams[i]);

	}

	if(hostAllocate == cudaSuccess)

	{

  cudaFreeHost(hostPtr);

	}

	

	if(deviceAllocate == cudaSuccess)

  {

  cudaFree(devicePtr);

  }

}

E.D_Riedijk · September 19, 2008, 8:20am

I get the same situation. :(

My program can only work with 512elements * 4 streams.

if the number of elements per stream is greater than 512elements, my program doesn’t work.

this is my code.

#include <stdio.h>

#include <cutil.h>

#define DATA Â 512 //only work with DATA is smaller than 512

extern "C"

__global__ void CudaStreamKernel(unsigned int *inputArray, unsigned int numOfElements)

{	

	unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int totalThreads = gridDim.x * blockDim.x;

	unsigned int numOfProcessElements = numOfElements / totalThreads + 1;

	for(unsigned int i = 0; i < numOfProcessElements; i++)

	{

 Â unsigned int tempIndex = idx + totalThreads * i;

 Â if(tempIndex <= numOfElements)

 Â {

 Â  inputArray[tempIndex] = Â 123;

 Â }

	}

}

//----------------------------------------------------------------------------------

void CudaStream(unsigned int multipleOf1MB, 

 Â  Â unsigned int numOfStreams, /*4*/

 Â  Â unsigned int numOfThreads, /*512*/

 Â  Â unsigned int numOfBLocks)/*1024*/

{

	unsigned int numOfElements =DATA;

	unsigned int size = numOfElements * sizeof(unsigned int);

	unsigned int streamSize = size / numOfStreams; 

	unsigned int* hostPtr = 0; 

	cudaError_t hostAllocate = cudaMallocHost((void**)&hostPtr, size); 

	

	unsigned int* devicePtr;

 Â  Â  cudaError_t deviceAllocate = cudaMalloc((void**)&devicePtr, size); 

 Â if(deviceAllocate == cudaSuccess)

 Â {

 Â cudaMemset(devicePtr, 0, size );

 Â }

	

	//create streams

	cudaStream_t *streams;

	streams = (cudaStream_t*) malloc(numOfStreams * sizeof(cudaStream_t));

	cudaError_t streamCreate;

	for (unsigned int i = 0; i < numOfStreams; i++) 

	{

 Â streamCreate = cudaStreamCreate(&(streams[i])); 

	}

	

	for (unsigned int i = 0; i < numOfStreams; ++i) 

	{

 Â  Â  cudaMemcpyAsync(devicePtr + i * numOfElements / numOfStreams, 

 Â  Â  hostPtr + i * numOfElements / numOfStreams, 

 Â  Â  streamSize, 

 Â  Â  cudaMemcpyHostToDevice, 

 Â  Â  streams[i]); 

	}

	

	dim3 dimBlock(numOfThreads, 1, 1);

	dim3 dimGrid(numOfBLocks, 1, 1);

	if(streamCreate == cudaSuccess)

	{

 Â for (unsigned int i = 0; i < numOfStreams; i++) 

 Â {

 Â  CudaStreamKernel<<< dimGrid, dimBlock, 0, streams[i] >>>(devicePtr + i * streamSize, numOfElements); Â 

 Â }

	}

	if(streamCreate == cudaSuccess)

	{

 Â for (unsigned int i = 0; i < numOfStreams; i++) 

 Â {

 Â cudaMemcpyAsync(hostPtr + i * numOfElements / numOfStreams, 

 Â  Â  Â devicePtr + i * numOfElements / numOfStreams, 

 Â  Â  Â streamSize, 

 Â  Â  Â cudaMemcpyDeviceToHost, 

 Â  Â  Â streams[i]);

 Â }

	}

	

	cudaThreadSynchronize();

	for(unsigned int i = 0; i < numOfStreams; i++)

	{

 Â cudaStreamDestroy(streams[i]);

	}

	if(hostAllocate == cudaSuccess)

	{

 Â cudaFreeHost(hostPtr);

	}

	

	if(deviceAllocate == cudaSuccess)

 Â {

 Â cudaFree(devicePtr);

 Â }

}

[snapback]441773[/snapback]

Are you sure you are not trying to use more than 512 threads per blocK? I see no error-checking for the kernel call whatsoever, so my guess is you are trying to launch more than 512 threads per block.

Quoc_Vinh · September 19, 2008, 9:04am

thank E.D. Riedijk :D

I sure that i was not trying to use more than 512 threads per block.

This program for my testing.

The thread increases from 2 to 512,

block increases from 2 to 2048,

stream increases from 2 to 64. every thing is OK if the elements processed by a stream is smaller than 512 .

[number of thread, block, stream is a multiply of 2]

and when i used more than 512 elements per stream, the kernel function can not launch

“unspecified launch failure”.

E.D_Riedijk · September 19, 2008, 9:07am

And it works when not using streams? Because an unspecifed launch failure usually means that your kernel access memory beyond the allocated memory.

Quoc_Vinh · September 19, 2008, 9:45am

thank E.D. Riedijk for your quick reply :D

it can work with stream = 1 to 64. [64 is the final number i tested]

but element per streams can not larger than 512.

this program is designed for my learning about stream so in the case without stream I can not test.

but i think that my code is easy and very simple to understand.

E.D_Riedijk · September 19, 2008, 5:24pm

Well, you cannot claim you have trouble with streams if in fact your kernel code is buggy. That is something you can check when you skip using streams for a second.

I am not 100% sure but this line

if(tempIndex <= numOfElements)

Should probably read

if(tempIndex < numOfElements)

Quoc_Vinh · September 22, 2008, 4:05am

Well, you cannot claim you have trouble with streams if in fact your kernel code is buggy. That is something you can check when you skip using streams for a second.

I am not 100% sure but this line
if(tempIndex <= numOfElements)
Should probably read
if(tempIndex < numOfElements)
[snapback]441929[/snapback]

Thank you E.D. Riedijk. :D

My kernel function has a problem in threadIndex, bolckIndex.

I don’t know how to solve that problem, so I desert this program and create new one. and it work. :D

pstach · September 27, 2008, 7:39pm

Getting this thread back on track with what the original poster was asking, when does nVidia plan to address the deficiencies in cudaStream_t and cudaEvent_t? The simplest reproduction case is take the CUDA SDK stream example and set number of streams to 8. From debugging the example, it seems to be data mismanagement or a race condition of sorts in the CUDA runtime, as the driver responds correctly to it (the runtime library) ioctl calls.

-Patrick

E.D_Riedijk · September 28, 2008, 6:44am

Well, I would say, pm an NVIDIA guy with your modified example, or did they already say it is a bug?

pstach · September 28, 2008, 5:32pm

Obviously getting an issue of this nature requires modification to the cuda runtime. Thus taking the obvious course of action I’ve emailed (about 2 weeks prior to the original post on this thread) to two different contacts at nVidia, that I had found said deficiency, along with a reproduction case, a fairly detailed system call trace, and a fairly detailed step by step memory dump of affected regions (of the runtime and respective driver variables). This course of actions would fall under the “Captain Obvious” realm of dealing with product support with any software company that has ever existed on the planet.

You have the official title of “CUDA Forums Captain Obvious”. You post completely obvious comments on everyone’s threads that has absolutely no value in resolving or aiding anyone. Further you sidetrack legitimate threads with banter instead of taking it to a different thread, like this one.

When I first saw this thread, I saw it lacked certain details, was unaddressed, and I had encountered a similar bug. Therefore by posting, I thought that the following could be accomplished:

Put another user at ease that the issue has been reported and that he’s not doing something wrong, as I’ve encountered this in almost any decently complex CUDA enabled program.
Further point out to the people I’ve submitted the bug to (they read the forums) that maybe the priority placed on resolution of this bug should be raised.

-Patrick

cbuchner1 · September 28, 2008, 6:02pm

Ah, Captain Unfriendly spreading negativity. Whatever the cause, this is bad attitude and I don’t like it much.

Christian

E.D_Riedijk · September 28, 2008, 6:34pm

Patrick,

I am not sure what is your problem, but this thread started with a somewhat vague bugreport. I pointed that out. Afterwards, there was someone else who reported a bug (with code) whom I tried to help.

You may call it obvious, but that is all I can do, I am not someone who make system call traces, memory dumps and such, I would have to learn how to first.

What I do is help when I can.

edit: I just re-read and noticed you posted you submitted numerous bug reports. Sorry that I missed/forgot that point. My remark about the pm was therefore stupid

DÃ¨nis