I have a Quadro K2200 and as I saw in the “Technical Specifications per Compute Capability” table I can have up to 32 concurrent kernels. Based on this, I ran 32 BlackScholes simultaneously. I modified the code and created another stream, instead of stream0, and I split the resources(threads and blocks) equally to each one of the 32 BlackScholes in order to fill the GPU. The total execution time was about 16sec. However, if I run one BlackScholes kernel with the same number of threads and blocks as those that I assigned when I split to 32, the execution time is about 3 seconds.
Is the execution time on 32 kernels running in parallel logical? Does it show a concurrency? I expected to take 3 seconds since the threads and blocks are exactly the same and all the kernels run in parallel. Is this maybe an indication of interference?