cudaStream_t what is the maximum number of streams that can be CREATED (not run at the same time)

Class with main problem

public:
        AlphaCorrEngine() {
            set_stateless();
            CUDA_CHECK(cudaStreamCreate(&stream_));
            mat.setStream(stream_);
        }

        ~AlphaCorrEngine() {            
            // have to comment below line to let CudaMat destructor works correctly
            CUDA_CHECK(cudaStreamDestroy(stream_));
        }

private:
    CudaMat<float> mat;
    cudaStream_t stream_;

CudaMat class:

CudaMat() {}
~CudaMat() {
    // line below will bug if destructor in AlphaCorrEngine destroys stream_
    cudaFreeAsync(data_, stream_);
    cudaStreamSynchronize(stream_);
}
void setStream(cudaStream_t stream) {stream_ = stream};
void resize(int N) {
    cudaMallocAsync(&data_, N * sizeof(float), stream_);
}


private:
   cudaStream_t stream_;
   size_t N_;
   float* data_;

I have a class that has a private cudaStream_t, and a member variable that uses cudaStream_t.
However, the destructor of AlphaCorrEngine is called BEFORE the destructors of its class members. In the destructor, the stream is destroyed, which cause bugs in the class member.

These class must have a cudaStream_t as a member variable, since they will be used like “job” object, and many jobs are launched in my program.

My solution is to never call cudaStreamDestroy. Therefore, I want to know what’s the max number of cudaStreamCreate that can exist at the same time. If there are too many streams, will it affect performance ? Assume number of stream is less than 10^6.

Thank you.

There is no published maximum, it is a function of resource utilization (eventually you may run out of resources and the stream creation may fail). If you have many streams, those streams will be assigned to a limited number of device connections (HW queues), in something like a round-robin fashion. Also see here.

So in my case, how should I handle the destructor such that the stream is only destroyed after every object that uses it has destructed? Do I need to make my own smart-pointer-ish class ?

It’s difficult to answer a question like that with almost no information about the application. With no additional information, then I would say that yes, this sounds like any other C++ resource create/destroy issue, and you should use a C++ mechanism that seems good to you. Reference counting may be one approach. Another could be use of scopes, if you can determine/delineate the scope of the stream(s) usage.

I don’t see how any of that “solves the problem” without any additional information about the nature of your application. Those mechanisms, by themselves, do not prevent an application from creating an arbitrary number of streams. In the completely unbounded case, you are likely to still run into problems with creation of an arbitrarily large number of streams.

When I am teaching CUDA I usually make the statement that I believe its rarely necessary to have more than about 5 streams per thread. When teaching the multi-GPU DLI, I show the class how the problem can be refactored to use 3-4 streams instead of ~50, with no loss of performance. The basic idea here is reuse of streams. But there’s no way I could suggest that that makes any sense based on what you have shown.

1 Like

In my case, each thread uses exactly 1 stream. But the problem is the application creates 1000s of threads (legacy code base, I can’t change), so I don’t know if it’ll cause any problem.

I found another solution, that is using try/catch like below:

try {
  CUDA_CHECK(cudaFreeAsync(data_ptr, stream_));
  CUDA_CHECK(cudaStreamSynchronize(stream_));
} catch (...) {
  CUDA_CHECK(cudaFree(data_ptr));
}

It’s guaranteed to be correct in my current usage. But is this bad code design?

Having an unbounded number of stream creations sounds like a bad code design to me. That’s what I was trying to suggest in my previous post.

for (int i = 0; i < num_streams; i++) 
  cudaStreamCreate(&streams[i]);
for (int i = 0; i < num_threads; i++) 
  launch_thread(ptr, data[i], streams[i%num_streams]);
1 Like

Thank you for your suggestion! streams[i % num_streams] might be a good solution.

The code I’m working with had a lot of constraints that make it kinda hard to add GPU support, so I just have to choose the least bad design.