massive tasks cost too much time

I have a data graph and 1000 query graphs.If I put all queries graphs together in a list and run them in GPU sequentially,almost all queries take longer time than they use when they are executed seperately.For example, query No. 500 may take 40ms when it’s executed seperately, but 500ms when it’s executed together with other queries.
here is the pseudocode when all queries are put together:

for queryGraph in queryList{
    device_func(dataGraph,queryGraph)
    cudaDeviceSynchronize()
}

so, is there any features of GPU may lead this result?

GPUs aren’t about fast execution of a single task. They are about execution of many tasks simultaneously. If each query become 10x slower, but you can run 1000 of them simultaneously, your overall speedup is still 100x

If all you need is fastest response time for a single query, CPU will be much better

Thank you, but I have divide a query task into many small tasks and distribute them to many blocks, so there should be enough parallelism in my code.
And I want to know why a query may take longer time when executed with other tasks sequentially on GPU, compared with executed alone on GPU. So they are both executed on GPU. Do you have any idea?

because resources of GPU are divided between all the tasks executed at some moment

So, the more tasks you run simultaneously, the slowly each individual one became. But overall performance is increased

Maybe I didn’t get myself clear. What you answered isn’t what I want to ask. I mean a query is executed on GPU one after another, not simultaneously. But thank you anyway :)

my bad, you have really shown that is “executed together”. can you show that is “executed separately”.? in particular at which points you measure the time? may be, it’s about loading data to the GPU? as usual, full reproducible testcase is better than hunting around

There is no difference in code skeletons of both cases. They are as follows:

for queryGraph in queryList{
    device_func(dataGraph,queryGraph)
    cudaDeviceSynchronize();
}

The difference is the input file.I run my code in command line as follows.
Case 1:

./run dataFile.txt query_1.txt

This is what I call “run alone”.

Case 2:

./run dataFile.txt querys.txt

This is what I call “together with other queries”.The “querys.txt” contain many queries, and the content of “query_1.txt” is one of them. That’s to say, in case 1,the queryList only has one query, while in case 2 it has many queries. I find query_1 takes more time than it takes when run “together with other queries”.

sorry, i still have to guess what is the difference between two cases. i.e. if your first program is

case1();
print(time);

and the second program was

case1();
print(time);
case2();

and time printed was different - i will be much surprised. and if there were other differences - you don’t explained that

how exactly are you taking your timing measurements? Using CUDA events would be one of the most accurate ways to measure the exact kernel run time.

Are we sure that the CUDA context is already created before you call into the first query? Creation of a CUDA context may take hundreds of milliseconds, so it could throw off any timing measurements you are trying to make.

I didn’t use CUDA events and I’m going to take your advice.thx.

I have considerd the creation of CUDA context and I have warm it up.

What you saied is exactly what I met,except that in the second program case1() may run after case2(),or there are other cases before case1. So I’m very surprised.