Strange behavior of execution time in concurrent kernels

mariapap95 · March 28, 2018, 11:09pm

Hello everyone,

I have a Quadro K2200 and as I saw in the “Technical Specifications per Compute Capability” table I can have up to 32 concurrent kernels. Based on this, I ran 32 BlackScholes simultaneously. I modified the code and created another stream, instead of stream0, and I split the resources(threads and blocks) equally to each one of the 32 BlackScholes in order to fill the GPU. The total execution time was about 16sec. However, if I run one BlackScholes kernel with the same number of threads and blocks as those that I assigned when I split to 32, the execution time is about 3 seconds.

Is the execution time on 32 kernels running in parallel logical? Does it show a concurrency? I expected to take 3 seconds since the threads and blocks are exactly the same and all the kernels run in parallel. Is this maybe an indication of interference?

BulatZiganshin · March 29, 2018, 1:47am

how much threads and block you use in both cases? can it be time for copying data? (just compute how much time you spend for second case with data copying included

Robert_Crovella · March 29, 2018, 4:50am

If you only created one stream, that would not be enough to run 32 kernels concurrently. You would need 32 streams.

Rather than ask this:

You might want to learn to use a profiler.

mariapap95 · March 29, 2018, 9:20am

I tried it using enough streams but the result was the same, the total execution time is smaller than if the kernels ran serially but it is not the execution time I expected (3sec for one kernel). I also know how to use a profiler but I haven’t use it for this situation as I have a script with 32 ./BlackScholes running. However, I thought that the execution time is a characteristic of what is happening.

Thank you both for your replies

mariapap95 · March 29, 2018, 9:34am

I created a different stream each time inside the BlackScholes code and then ran all the 32 executables together in the script. For example one ./BlackScholes was the executable of the code with the stream1, another ./BlackSCholes was the executable of the code with stream2 etc.
Maybe the problem is that all the 32 streams must be created together inside the same code in order to be different?

cbuchner1 · March 29, 2018, 10:58am

CUDA streams only work within the same CUDA context.

Different executables run in different contexts.

mariapap95 · March 29, 2018, 11:34am

okkk then! thank you very much!

mariapap95 · March 30, 2018, 3:22pm

I created 32 streams in the same code and called the kernel 32 times, each time in different stream. I used the visual profiler and the kernels were not concurrent while the execution time was just 3sec. (the execution time of one kernel). As I saw the streams were different and everything executed as it should. How can the execution time be just 3 seconds if the kernels were not concurrent?

BulatZiganshin · March 30, 2018, 4:11pm

each kernel is execute in 3/32 seconds :)

overall, GPU has thousands of cores, each core can execute up to 16 threads simultaneously

the kernel call may contain thousands or millions of threads, which are started on GPU as resources allow. so, each of your smaller kernels has to create 32x less threads, and hence it was executed 32x faster than the large kernel