Why does cudaStreamAddCallback serialize kernel execution and break concurrency?

I have some painful problems when I’m using cudaStreamAddCallback in my code. I managed some stream and use cudaStreamAddCallback to register callback which notify me when tasks assigned to the stream is completed.
My device is GTX 970 and my development environment is Visual Studio 2013 with CUDA 7.

The following code and image is my code without cudaStreamAddCallback and its timeline profiled by nsight. Everything looks good:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_call_check.h"

#define NUM_STREAM  32
#define ARRAY_SIZE  1000000

__global__ void kernel_add_one(float* point, unsigned int num)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

    for (size_t offset = idx; offset < num; offset += gridDim.x * blockDim.x)
    {
        point[offset] += 1;
    }
}

void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *arg)
{

}

int main()
{
    float *dev_points[NUM_STREAM];
    float *host_points[NUM_STREAM];
    cudaStream_t streams[NUM_STREAM];

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaMalloc(dev_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaMallocHost(host_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaStreamCreateWithFlags(streams + i, cudaStreamNonBlocking));
        for (size_t j = 0; j < ARRAY_SIZE; ++j)
        {
            host_points[i][j] = static_cast<float>(i + j);
        }
    }

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaMemcpyAsync(dev_points[i], host_points[i], 
                                       ARRAY_SIZE * sizeof(float), 
                                       cudaMemcpyHostToDevice, streams[i]));
        kernel_add_one<<<1, 16, 0, streams[i]>>>(dev_points[i], ARRAY_SIZE);
        CUDA_SAFE_CALL(cudaMemcpyAsync(host_points[i], dev_points[i], 
                                       ARRAY_SIZE * sizeof(float), 
                                       cudaMemcpyDeviceToHost, streams[i]));
    }

    CUDA_SAFE_CALL(cudaDeviceSynchronize());

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaFree(dev_points[i]));
        CUDA_SAFE_CALL(cudaFreeHost(host_points[i]));
        CUDA_SAFE_CALL(cudaStreamDestroy(streams[i]));
    }
}

But if I using following codes, the kernel execution is serialized. The difference of the two code is about line 48.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_call_check.h"

#define NUM_STREAM  32
#define ARRAY_SIZE  1000000

__global__ void kernel_add_one(float* point, unsigned int num)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

    for (size_t offset = idx; offset < num; offset += gridDim.x * blockDim.x)
    {
        point[offset] += 1;
    }
}

void CUDART_CB callback(cudaStream_t stream, cudaError_t status, void *arg)
{

}

int main()
{
    float *dev_points[NUM_STREAM];
    float *host_points[NUM_STREAM];
    cudaStream_t streams[NUM_STREAM];

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaMalloc(dev_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaMallocHost(host_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaStreamCreateWithFlags(streams + i, cudaStreamNonBlocking));
        for (size_t j = 0; j < ARRAY_SIZE; ++j)
        {
            host_points[i][j] = static_cast<float>(i + j);
        }
    }

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaMemcpyAsync(dev_points[i], host_points[i], 
                                       ARRAY_SIZE * sizeof(float), 
                                       cudaMemcpyHostToDevice, streams[i]));
        kernel_add_one<<<1, 16, 0, streams[i]>>>(dev_points[i], ARRAY_SIZE);
        CUDA_SAFE_CALL(cudaMemcpyAsync(host_points[i], dev_points[i], 
                                       ARRAY_SIZE * sizeof(float), 
                                       cudaMemcpyDeviceToHost, streams[i]));
        CUDA_SAFE_CALL(cudaStreamAddCallback(streams[i], callback, NULL, 0));
    }

    CUDA_SAFE_CALL(cudaDeviceSynchronize());

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaFree(dev_points[i]));
        CUDA_SAFE_CALL(cudaFreeHost(host_points[i]));
        CUDA_SAFE_CALL(cudaStreamDestroy(streams[i]));
    }
}

Timeline is following:

Does some one know the reason of this problem? It has bothered me for a long time.

“Callbacks without a mandated order (in independent streams) execute in
undefined order and may be serialized.”

“Adding device work to any stream does not have the effect of making the stream
active until all preceding callbacks have executed. Thus, for example, a callback
might use global attached memory even if work has been added to another stream,
if it has been properly ordered with an event.”

i am not sure how to interpret the quotes, particularly the latter one
why do you intend to use callbacks - any particular reason?
in my opinion, callbacks are limiting and restrictive, given what you are allowed and not allowed within callback functions
why can’t you make do with events?

perhaps you can verify that all work have been issued in all streams, prior to the first callback executing; if this is the case, i would seriously revisit the latter quote and see it as meaning multiple (outstanding/ pending) callbacks serialize streams

Yes, I’ve some particular reason. I want to be notified while some tasks which include kernel executing and data transferring between host and device is complete, so that host can assign some idle streams to do some new tasks which base on the completed tasks with automatic load balancing.

Your suggestion stated in the last part is correct and I’ve experimented it before, although this idea can’t meet my requirement.

Moreover, I’ve tried to implement the callback system by myself. I’ve created 32 work threads, every work thread master one stream. The communication between work stream and main stream uses message queue. For every work thread, they get their assignment from main thread using message queue and transfer data or launch kernel with their mastered stream synchronously, then the work threads notify main thread while the task in their stream is complete. I used these way for work threads to know whether the tasks in their stream is complete or not:

  1. use cudaStreamSynchronize to block the work thread until all tasks in stream is complete;
  2. use cudaStreamQuery and while(true) to poll the return value until it returns cudaSuccess which means the tasks in the stream are all complete;
  3. for every assignment from main thread, I insert a cudaEvent into the stream after other tasks in the assignment are inserted into the stream, and then I called cudaEventSynchronize in work thread to wait for the cudaEvent which means the tasks which inserted before it are all complete.
    All of these 3 idea failed and they are all make the kernel execution in different stream serialized(but the condition is better than only use one thread because of multi-thread execution order is undefined and synchronous task may be inserted later than inserted serially).
    I’ve also used --default-stream per-thread flag as compile flag which is state here:http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/, but nothing changed.

Anyway, thanks for your reply and advise, little_jimmy!

perhaps you should be careful: if an idea fails, it does not necessarily mean the idea does not work
this is particularly true with code, given that code almost always must be tested/ debugged first
poor/ imperfect implementation can sink any solid idea; should one then blame the idea, or the implementation?
i do not see why any of your 3 ideas can not work in principle; perhaps you just need to debug scrupulously

personally, the mechanism i prefer is to forward issue a string of work (x work blocks) (to and fro memory copies and kernel launch), issue events, forward issue another string of work, again with events, and then come back to the first string of work, which should start to conclude around that point
one then simply rolls over and continue with this process, such that a string of work is always pending on the device

hence, i really think a personalized, asynchronous callback system is possible

Here is my callback system:
main.cpp:

#include "cuda_call_check.h"
#include "test_kernel.h"
#include "thread_pool.h"

static const unsigned int ARRAY_SIZE = 1000000;
static const unsigned int NUM_STREAM = 32;

ThreadPool tp;
cudaStream_t *streams;
cudaEvent_t *events;

struct Instance
{
    unsigned int    idx_;
    float           *dev_point_;
    float           *host_point_;
    unsigned int    step_;
};

void run_thread(void *arg)
{
    Instance *ins = (Instance*)arg;

    //CUDA_SAFE_CALL(cudaStreamCreateWithFlags(streams + ins->idx_, cudaStreamNonBlocking));
    //CUDA_SAFE_CALL(cudaStreamCreate(streams + ins->idx_));

    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->dev_point_, ins->host_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice, streams[ins->idx_]));
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->host_point_, ins->dev_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyDeviceToHost, streams[ins->idx_]));

    CUDA_SAFE_CALL(cudaStreamSynchronize(streams[ins->idx_]));

    //while (cudaSuccess != cudaStreamQuery(streams[ins->idx_]));

    //CUDA_SAFE_CALL(cudaEventRecord(events[ins->idx_], streams[ins->idx_]));
    //CUDA_SAFE_CALL(cudaEventSynchronize(events[ins->idx_]));
}

int main(int argc, char **argv)
{
    float **dev_points, **host_points;
    Instance *instances;

    dev_points = new float*[NUM_STREAM];
    host_points = new float*[NUM_STREAM];
    streams = new cudaStream_t[NUM_STREAM];
    instances = new Instance[NUM_STREAM];
    events = new cudaEvent_t[NUM_STREAM];
    
    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaStreamCreateWithFlags(streams + i, cudaStreamNonBlocking));
        //CUDA_SAFE_CALL(cudaStreamCreate(streams + i));
        CUDA_SAFE_CALL(cudaMalloc(dev_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaMallocHost(host_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaEventCreate(events + i));
        for (size_t j = 0; j < ARRAY_SIZE; ++j)
        {
            host_points[i][j] = static_cast<float>(i + j);
        }
        instances[i].idx_ = i;
        instances[i].dev_point_ = dev_points[i];
        instances[i].host_point_ = host_points[i];
        instances[i].step_ = 0;
    }

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        tp.send_task(&run_thread, instances + i);
    }

    tp.destory();

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaStreamDestroy(streams[i]));
        CUDA_SAFE_CALL(cudaFree(dev_points[i]));
        CUDA_SAFE_CALL(cudaFreeHost(host_points[i]));
    }
    delete[]dev_points;
    delete[]host_points;
    delete[]streams;
    delete[]instances;

    return 0;
}

test_kernel.cu:

#include "test_kernel.h"
#include <device_launch_parameters.h>
void add_one(float *dev_point, unsigned int num, cudaStream_t stream)
{
    unsigned int num_blocks, num_threads;

    kernel_add_one <<<1, 4, 0, stream>>>(dev_point, num);
}

And the timeline without --default-stream per-thread flag is:

You can add the compile flag, use different stream creating function and put them in main() or in run_thread(), use different stream synchronous function(such as cudaStreamSynchronize ,cudaStreamQuery or cudaEventSynchronize ) and see the result. Different usage may lead to different result and most of them looks terrible like the image above.

ThreadPool is implemented by C++11 std::thread. ThreadPool::destroy wait for every thread in pool returns.

line 72:

tp.destory();

i am sure it is merely a type error; nonetheless, ‘destory’ is rather amusing…

but lets debug, shall we:

there is hardly a pattern in the kernel launch timelines; or, perhaps the pattern is a random one, which in itself is suggestive

i am not conversant with the specifics of the 970 - the device you note to be using
there is an upper limit on the number of kernels that may be resident per sm at any time; from this you can determine an upper limit on the number of kernels resident for the device at any time
just check the upper limit

also, how many cores does your cpu have?
you are possibly creating more threads than the number your cpu can seat at any given time
i do not know how std::thread threadpool would approach such a scenario - are all threads allowed execution time from the start, or must threads wait in line?
the threads self would busy wait on synchronization calls; i also am not sure what this would imply from a os scheduling perspective, and threads in the pool trying to get execution time
and with your holistic thread count, you may be putting enormous pressure on the host, from the perspective of the host itself attempting to create worker threads as part of executing your host code

thus, you may wish to try one or more of the following:
a) set the synchronization call flags to yield, rather than busy wait
b) restrict the number of threads to a number less than the number of threads on the core
c) have the same threads issue multiple work tasks

note the time again after this, to note whether this has any effect

also, i must really check up on the use of cudaStreamCreateWithFlags and cudaStreamNonBlocking on the host side; i am more familiar with the mentioned apis being used on the device side as part of dynamic parallelism

I’m really thank you for your concerns, little_jimmy!

Some of properties of GTX 970:

  • asyncEngineCount = 2, means the device can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory.
  • kernelExecTimeoutEnabled = 1, kernel has runtime limit.
  • streamPrioritiesSupported = 0, do not support stream priorities.

I’ve also set CUDA_DEVICE_MAX_CONNECTIONS = 32 in environment variable. But I’m not sure this setting is working.
BTW, I’ve test my program on GTX 750TI and GT 755M and I got the same result.

My CPU is Intel Core2 Q9650 which has 4 cores, and I created 32 worker thread. Then I set the number of threads in thread pool as 8, the timeline is as follow:

4 threads:

Threads in thread pool do not wait in line and are blocked by std::condition_variable::wait() while there is no work to do. If a work comes, work request thread will call std::condition_variable::notify_one() to wake one of threads in pool up. It is unspecified which of the threads is selected. I believe there is no busy waiting.

“have the same threads issue multiple work tasks”? Execute kernel function twice as following ?

void run_thread(void *arg)
{
    Instance *ins = (Instance*)arg;

    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->dev_point_, ins->host_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice, streams[ins->idx_]));
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->host_point_, ins->dev_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyDeviceToHost, streams[ins->idx_]));

    CUDA_SAFE_CALL(cudaStreamSynchronize(streams[ins->idx_]));
}

Result with 32 threads in thread pool:

Or like this:

void run_thread(void *arg)
{
    Instance *ins = (Instance*)arg;

    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->dev_point_, ins->host_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice, streams[ins->idx_]));
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->host_point_, ins->dev_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyDeviceToHost, streams[ins->idx_]));
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->dev_point_, ins->host_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice, streams[ins->idx_]));
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->host_point_, ins->dev_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyDeviceToHost, streams[ins->idx_]));

    CUDA_SAFE_CALL(cudaStreamSynchronize(streams[ins->idx_]));
}

I’ve heared of dynamic parallelism. It is very interesting and powerful, but I have not used it yet.

Here is my question on stackoverflow: http://stackoverflow.com/questions/29433938/why-does-cudastreamaddcallback-serialize-kernel-execution-and-break-concurrency?noredirect=1#comment47041622_29433938
Is it possible that CUDA has some defect on Windows?

I had to solve a similiar issue before in linux. I’d guess your problem is the order that commands are issued is essentially randomized, so not ideal. Check this is, in particular the stream scheduling part. Though that’s for a much older architecture it may still be accurate.

My solution was to have all of the memcpy’s and kernels launched from a single thread with an event to determine when they’ve finished. After issuing the event it would place the event and other relevant data on a work queue to be processed by a lower priority worker thread. Since all my kernels take the same amount of time the events should finish in similar order that they were issued, a basic queue was enough. I had each worker poll the event with a 10 microsecond sleep then start processing once the event had completed. The events were created with cudaEventDisableTiming. I had to poll with sleep since event synchronize is a spin lock and would slow down the rest of CUDA, blocking sync slowed my short lived kernels way too much and Callback synchronized everything. The worker threads were set to a lower priority so they didn’t slow down the issuing thread.

this was more what i was referring to:

Maximum number of resident blocks per multiprocessor (cc 5) 32

“The maximum number of kernel launches that a device can execute concurrently is 16
on devices of compute capability 2.0 through 3.0; the maximum is 32 concurrent kernel
launches on devices of compute capability 3.5 and higher. Devices of compute capability
3.2 are limited to 4 concurrent kernel launches.”

this is from the programming guide
so, clearly you are within bounds in terms of the upper limit

“have the same threads issue multiple work tasks”? Execute kernel function twice as following ?

no, not exactly. but already that particular test or test code of yours is significant in that it manages to increase the number of kernels running concurrently from 4 to 6
what i meant is, for x tasks and y worker threads, have (x / y) > 1, such that you manage to issue the same amount of tasks with fewer threads
in other words, pass multiple work task ‘packets’ or ‘orders’ to the same thread, instead of 1

with this:

tp.send_task(&run_thread, instances + i);

can you send multiple instances to a thread? i think the threadpool is restricting you, to the extent that it consumes too many threads, and to the extent that it prevents you from getting all work (kernels) out, before starting to wait for their completion
perhaps a very rudimentary work around is to issue 2 sets of work via tp.send_task
the 1 round issue all kernels with events, but without any synchronization - no waiting
the second round links to the events, and does the synchronization - the waiting
hence, you would essentially take run_thread and split it around the point of synchronization

for (cases)
{
tp.send_taskA(&run_thread, instances + i); // issue work without waiting
}

for (cases)
{
tp.send_taskB(&run_thread, instances + i); // do the waiting via events
}

“I believe there is no busy waiting”

CUDA_SAFE_CALL(cudaStreamSynchronize(streams[ins->idx_]));

the above and other synchronization type calls may very well imply busy waiting
some of these calls have flags that determine whether it would yield or busy wait
with others, busy waiting may be implicit, and without a flag to prevent it
the prudent thing is to be cognizant of whether an api is true-form synchronous or asynchronous, and whether it would busy wait in the case of the former

To mwilkinson:
Your advice was a great help to me. I’ve used lots of time to try again using cudaEvent in many combinations. Most of them result in bad performance, but some of them are really good with unsteadiness. cudaEventSynchronize and cudaEventQuery can both get good performance and they are really better than cudaStreamSynchronize. Thank you very much.
I’ll post my code and timeline at the bottom of this post.

To little_jimmy:
I think your point about upper limit may be right. The performance of the program became stable by changing the number of stream from 32 to 16 and it is no much use setting the number as less than 16. But I’m still not sure about the mechanism because we should be able to use 32 concurrent kernel which you’ve quoted.

“have the same threads issue multiple work tasks”. This is lazy task creation(LTC) strategy, isn’t it? I’ve used it while I was constructing an application server 1 year ago and this is my old profession. I’m confused with the reason why my program can not handle 33 threads because my server used “serve one client with each server thread with blocking I/O” strategy(use proactor pattern later for higher efficiency) and it is easy for it to handle more than 100 concurrent client requests with low latency.
Using LTC is a good idea. Device would not need to communicate with host while its task is completed on every time. But it is quite difficult to design a algorithm about task combining strategy. I think this will depend on the actual situation.
Using busy waiting to implement synchronization type calls which is inevitable to cost plenty of CPU time-slice by a corporation like nvidia is terrible! I can hardly believe it!
Anyway, your ideas are great inspirations to me. Thank you!

Here is my codes:

#include "cuda_call_check.h"
#include "test_kernel.h"
#include "thread_pool.h"

static const unsigned int ARRAY_SIZE = 1000000;
static const unsigned int NUM_STREAM = 16;

static std::chrono::system_clock::time_point START_TIME;

ThreadPool tp;
cudaStream_t *streams;
cudaEvent_t *events;

struct Instance
{
    unsigned int    idx_;
    float           *dev_point_;
    float           *host_point_;
    unsigned int    step_;
};

void run_thread(void *arg)
{
    Instance *ins = (Instance*)arg;

    CUDA_SAFE_CALL(cudaStreamCreateWithFlags(streams + ins->idx_, cudaStreamNonBlocking));

    std::this_thread::sleep_until(START_TIME + std::chrono::milliseconds(600));

    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->dev_point_, ins->host_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyHostToDevice, streams[ins->idx_]));
    add_one(ins->dev_point_, ARRAY_SIZE, streams[ins->idx_]);
    CUDA_SAFE_CALL(cudaMemcpyAsync(ins->host_point_, ins->dev_point_, ARRAY_SIZE * sizeof(float), cudaMemcpyDeviceToHost, streams[ins->idx_]));

    CUDA_SAFE_CALL(cudaEventRecord(events[ins->idx_], streams[ins->idx_]));
    CUDA_SAFE_CALL(cudaEventSynchronize(events[ins->idx_]));
}

int main(int argc, char **argv)
{
    START_TIME = std::chrono::system_clock::now();

    float **dev_points, **host_points;
    Instance *instances;

    dev_points = new float*[NUM_STREAM];
    host_points = new float*[NUM_STREAM];
    streams = new cudaStream_t[NUM_STREAM];
    instances = new Instance[NUM_STREAM];
    events = new cudaEvent_t[NUM_STREAM];
    
    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaMalloc(dev_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaMallocHost(host_points + i, ARRAY_SIZE * sizeof(float)));
        CUDA_SAFE_CALL(cudaEventCreateWithFlags(events + i, cudaEventBlockingSync));
        for (size_t j = 0; j < ARRAY_SIZE; ++j)
        {
            host_points[i][j] = static_cast<float>(i + j);
        }
        instances[i].idx_ = i;
        instances[i].dev_point_ = dev_points[i];
        instances[i].host_point_ = host_points[i];
        instances[i].step_ = 0;
    }

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        tp.send_task(&run_thread, instances + i);
    }

    tp.destory();

    for (size_t i = 0; i < NUM_STREAM; ++i)
    {
        CUDA_SAFE_CALL(cudaStreamDestroy(streams[i]));
        CUDA_SAFE_CALL(cudaFree(dev_points[i]));
        CUDA_SAFE_CALL(cudaFreeHost(host_points[i]));
    }
    delete[]dev_points;
    delete[]host_points;
    delete[]streams;
    delete[]instances;

    return 0;
}

The performance: timeline of runtime apis and streams

But the performance is unstable. like this:

Tips:

  • Compile flag --default-stream per-thread may make condition worse in some case. I don't use it now.
  • Stream should be created in worker thread. I've no idea about effects of cudaStreamNonBlocking
  • cudaEventBlockingSync + cudaEventSynchronize can get good performance, so as cudaEventDisableTiming + cudaEventQuery
  • cudaStreamSynchronize, cudaStreamQuery are worse than cudaEventSynchronize and cudaEventQuery.

Tips above are merely my experience.

again, if you take a step back, you would note that you now manage to run 8 - 12 kernels concurrently
i think at this point you are much the wiser than when you started, and you should now be more clear as to possible factors at play, why your initial design did perhaps not work as intended, and factors you likely overlooked with your very initial design

“I’m confused with the reason why my program can not handle 33 threads”

i do not think that your program can not handle 33 threads; it is more a case of your program struggling to get by with that many threads, given the factors at play, and how your program adapts to that

“But it is quite difficult to design a algorithm about task combining strategy”

come now.
you seem set on using threadpools. i am not going to question this
if you had more work (kernels), i think you would not face the difficulties you are facing now, as the work would have naturally hidden the device execution latency
you may be struggling because your work (kernel) :: worker-threads ratio is (too) low
clearly, it is not a good idea to have too many threads wait on the device, particularly when you commence with waiting very soon after having issued kernels, and particularly when you busy-wait (you can set whether the apis must busy wait or not)
thus, if your are going to use threadpools, you should perhaps more seriously consider splitting the issuing work part, and the catching work (waiting on work) part, in order to better adapt to the known latency of the device processing work
if you split issuing of work and waiting on work, you may retain ease of work, whilst improving performance
perhaps you could use 2 threadpools - the first is used to issue (all) kernels - a single task per work-thread; the second is used to wait on kernels
but you likely need to have a single thread wait on multiple tasks, to improve your waiting
i do not see a problem with this, as such a thread really needs little information to know what work (event) to wait on
and the thread can wait in sequence: for the x tasks to wait on, starting at y, while i < x, wait on i, and move on to waiting on i + 1, when i is done
and set the api flags so that the threads do not busy wait when calling cudaEventSynchronize, etc

you need to be mindful of the amount of useful work :: amount of work done by host threads
improving your method of waiting can help with this

Your experience in programming is good for me and I’ll try your idea later.

I’m always reflecting on what I’ve done these days.