cudaStream_t what is the maximum number of streams that can be CREATED (not run at the same time)

lehuyduc4 · October 18, 2022, 11:01am

Class with main problem

public:
        AlphaCorrEngine() {
            set_stateless();
            CUDA_CHECK(cudaStreamCreate(&stream_));
            mat.setStream(stream_);
        }

        ~AlphaCorrEngine() {            
            // have to comment below line to let CudaMat destructor works correctly
            CUDA_CHECK(cudaStreamDestroy(stream_));
        }

private:
    CudaMat<float> mat;
    cudaStream_t stream_;

CudaMat class:

CudaMat() {}
~CudaMat() {
    // line below will bug if destructor in AlphaCorrEngine destroys stream_
    cudaFreeAsync(data_, stream_);
    cudaStreamSynchronize(stream_);
}
void setStream(cudaStream_t stream) {stream_ = stream};
void resize(int N) {
    cudaMallocAsync(&data_, N * sizeof(float), stream_);
}


private:
   cudaStream_t stream_;
   size_t N_;
   float* data_;

I have a class that has a private cudaStream_t, and a member variable that uses cudaStream_t.
However, the destructor of AlphaCorrEngine is called BEFORE the destructors of its class members. In the destructor, the stream is destroyed, which cause bugs in the class member.

These class must have a cudaStream_t as a member variable, since they will be used like “job” object, and many jobs are launched in my program.

My solution is to never call cudaStreamDestroy. Therefore, I want to know what’s the max number of cudaStreamCreate that can exist at the same time. If there are too many streams, will it affect performance ? Assume number of stream is less than 10^6.

Thank you.

Robert_Crovella · October 18, 2022, 2:08pm

There is no published maximum, it is a function of resource utilization (eventually you may run out of resources and the stream creation may fail). If you have many streams, those streams will be assigned to a limited number of device connections (HW queues), in something like a round-robin fashion. Also see here.

lehuyduc4 · October 19, 2022, 2:34am

So in my case, how should I handle the destructor such that the stream is only destroyed after every object that uses it has destructed? Do I need to make my own smart-pointer-ish class ?

Robert_Crovella · October 19, 2022, 3:07pm

It’s difficult to answer a question like that with almost no information about the application. With no additional information, then I would say that yes, this sounds like any other C++ resource create/destroy issue, and you should use a C++ mechanism that seems good to you. Reference counting may be one approach. Another could be use of scopes, if you can determine/delineate the scope of the stream(s) usage.

I don’t see how any of that “solves the problem” without any additional information about the nature of your application. Those mechanisms, by themselves, do not prevent an application from creating an arbitrary number of streams. In the completely unbounded case, you are likely to still run into problems with creation of an arbitrarily large number of streams.

When I am teaching CUDA I usually make the statement that I believe its rarely necessary to have more than about 5 streams per thread. When teaching the multi-GPU DLI, I show the class how the problem can be refactored to use 3-4 streams instead of ~50, with no loss of performance. The basic idea here is reuse of streams. But there’s no way I could suggest that that makes any sense based on what you have shown.

lehuyduc4 · October 19, 2022, 3:18pm

In my case, each thread uses exactly 1 stream. But the problem is the application creates 1000s of threads (legacy code base, I can’t change), so I don’t know if it’ll cause any problem.

I found another solution, that is using try/catch like below:

try {
  CUDA_CHECK(cudaFreeAsync(data_ptr, stream_));
  CUDA_CHECK(cudaStreamSynchronize(stream_));
} catch (...) {
  CUDA_CHECK(cudaFree(data_ptr));
}

It’s guaranteed to be correct in my current usage. But is this bad code design?

Robert_Crovella · October 19, 2022, 3:26pm

Having an unbounded number of stream creations sounds like a bad code design to me. That’s what I was trying to suggest in my previous post.

for (int i = 0; i < num_streams; i++) 
  cudaStreamCreate(&streams[i]);
for (int i = 0; i < num_threads; i++) 
  launch_thread(ptr, data[i], streams[i%num_streams]);

lehuyduc4 · October 19, 2022, 3:31pm

Thank you for your suggestion! streams[i % num_streams] might be a good solution.

The code I’m working with had a lot of constraints that make it kinda hard to add GPU support, so I just have to choose the least bad design.

Topic		Replies	Views
How many streams? Maximum number of streams CUDA Programming and Performance	20	8627	January 7, 2025
Streams and multiprocessor usage? CUDA Programming and Performance	3	2899	September 20, 2008
Streams and multi-gpu CUDA Programming and Performance	10	2191	June 17, 2014
Problem using streams Can't get more than one stream to work CUDA Programming and Performance	3	4665	October 8, 2008
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	1285	February 23, 2024
cudaStreamCreate and cudaStreamDestroy overhead? CUDA Programming and Performance	4	4960	March 29, 2015
My streams are not running concurrently CUDA Programming and Performance	7	1799	March 6, 2018
Odd cudaStream_t behavior CUDA Programming and Performance	8	5968	September 3, 2008
Is 40000 cuda streams an issue? TensorRT	3	353	September 13, 2021
CUDA stream CUDA Programming and Performance	1	4651	April 11, 2010

cudaStream_t what is the maximum number of streams that can be CREATED (not run at the same time)

Related topics