Max Pascal Kernel Concurrency = 32?

I am experimenting with concurrency on a GT 1030, and I found that I am not able to run more than 32 kernels simultaneously. I am using launch parameters of <<<1, 1>>> with a dedicated stream for each launch, and it does not matter how simple I make the kernel code.

To use some code that everyone has access to, this is what I see when I run the concurrentKernels example project that comes with the CUDA Toolkit:
31 kernels
Measured time for sample = 0.012s
32 kernels
Measured time for sample = 0.012s
33 kernels
Measured time for sample = 0.024s
34 kernels
Measured time for sample = 0.024s

There is a clear doubling once the magic number of 32 is exceeded. When running 33 kernels, I can see in the profiler timeline that one of the 33 kernels is clearly serialized relative to the rest. For 32 kernels, the profiler timeline shows all 32 in parallel.

The CUDA Programming Guide table of Technical Specifications per Compute Capability shows that there is a maximum of 32 blocks per SM. As far as I knew, each block could be a different kernel. That is clearly not working out in practice, or else I should be able to run 96 simultaneous kernels (32 blocks per SM times 3 SMs on the GT 1030).

Does this limit of 32 apply to all Pascal cards, or is it specific to the GT 1030? Is the kernel concurrency limit for Pascal documented anywhere?

The limit varies by compute capability. It is documented in the first line of table 14 in the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability (the same table you already referenced – note that the first line specifically refers to concurrent kernels).

Your GPU is of compute capability 6.1 (can be discovered using deviceQuery). It has a limit of 32 “resident grids” or concurrent kernels.

Thanks! I have looked at that table countless times, but somehow I never noticed that first line and the associated link.