Strange behavior of execution time in concurrent kernels

Hello everyone,

I have a Quadro K2200 and as I saw in the “Technical Specifications per Compute Capability” table I can have up to 32 concurrent kernels. Based on this, I ran 32 BlackScholes simultaneously. I modified the code and created another stream, instead of stream0, and I split the resources(threads and blocks) equally to each one of the 32 BlackScholes in order to fill the GPU. The total execution time was about 16sec. However, if I run one BlackScholes kernel with the same number of threads and blocks as those that I assigned when I split to 32, the execution time is about 3 seconds.

Is the execution time on 32 kernels running in parallel logical? Does it show a concurrency? I expected to take 3 seconds since the threads and blocks are exactly the same and all the kernels run in parallel. Is this maybe an indication of interference?

how much threads and block you use in both cases? can it be time for copying data? (just compute how much time you spend for second case with data copying included

If you only created one stream, that would not be enough to run 32 kernels concurrently. You would need 32 streams.

Rather than ask this:

You might want to learn to use a profiler.

I tried it using enough streams but the result was the same, the total execution time is smaller than if the kernels ran serially but it is not the execution time I expected (3sec for one kernel). I also know how to use a profiler but I haven’t use it for this situation as I have a script with 32 ./BlackScholes running. However, I thought that the execution time is a characteristic of what is happening.

Thank you both for your replies

I created a different stream each time inside the BlackScholes code and then ran all the 32 executables together in the script. For example one ./BlackScholes was the executable of the code with the stream1, another ./BlackSCholes was the executable of the code with stream2 etc.
Maybe the problem is that all the 32 streams must be created together inside the same code in order to be different?

CUDA streams only work within the same CUDA context.

Different executables run in different contexts.

okkk then! thank you very much!

I created 32 streams in the same code and called the kernel 32 times, each time in different stream. I used the visual profiler and the kernels were not concurrent while the execution time was just 3sec. (the execution time of one kernel). As I saw the streams were different and everything executed as it should. How can the execution time be just 3 seconds if the kernels were not concurrent?

each kernel is execute in 3/32 seconds :)

overall, GPU has thousands of cores, each core can execute up to 16 threads simultaneously

the kernel call may contain thousands or millions of threads, which are started on GPU as resources allow. so, each of your smaller kernels has to create 32x less threads, and hence it was executed 32x faster than the large kernel