Streams, Host Threads, and the Profiler in CUDA 5.0

My program crashes when I run it in the visual profiler, if I create more than about 24 cuda streams. It works fine otherwise. The program has several host-side threads which are necessary for achieving the desired frame rate, even though I am expecting a big boost from the GPU.

The following code fragment is stripped down to illustrate the point. It just makes and destroys streams. Obviously, I would like to have useful code between the create and destroy, but just this code makes the program crash. Strangely, it crashes in seemingly unrelated parts of the code. Also strangely, it crashes only when I am running the visual profiler. The application is 32 bit; everything is CUDA 5.0. The OS is WIndows 7 SP1 with current updates. The driver is 9.18.13.1106

for( int ii=0; ii < 10; ++ii )
{
	cudaError_t errCode = cudaSuccess;
	errCode = cudaSetDevice( 0 );
	if( errCode != cudaSuccess )
		throw std::runtime_error( "could not cudaSetDevice" );
	cudaStream_t gpuStream=0;
	Trace( "creating cuda stream\n" );
	errCode = cudaStreamCreate( & gpuStream );
	if( errCode != cudaSuccess )
		throw std::runtime_error( "could not cudaStreamCreate" );
	if( gpuStream )
	{
		errCode = cudaStreamSynchronize( gpuStream );
		if( errCode != cudaSuccess )
			throw std::runtime_error( "could not cudaStreamSynchronize" );
		errCode = cudaStreamDestroy( gpuStream );
		if( errCode != cudaSuccess )
			throw std::runtime_error( "could not cudaStreamDestroy" );
		gpuStream = 0;
	}
}

It gets into trouble some time after about 24 calls to cudaStreamCreate. This is true whether all the calls come from the main thread or whether I create several streams in each of 4 worker threads. The error codes are all OK.

What is wrong? I don’t see anything in the document that says I shouldn’t be able to do this?

I’m using CULA in a 64bit env and CULA opens and destroys like tens of streams and the
profiler handles it just right.
So maybe the problem is in another piece of code? or try to use the 64bit?

eyal

We don’t have any known issues with the profiler with a large number of streams but you may be hitting some corner case that we haven’t seen before. The CUDA 5.5 toolkit is now publically available. Can you try using the 5.5 visual profiler to see if that works for your application.