This is possibly a complicated question to answer. It is strongly related to the idea of occupancy, which is widely discussed on forums and even in CUDA documentation.
Briefly, occupancy considers how much of a GPUs resources you are using, and this is the key to answering your question. A GPU has various resource limits, such as the number of SMs which implies the maximum number of blocks that can be simultaneously resident or “concurrent”, also registers, shared memory, and other limits.
To determine occupancy statically or theoretically (as opposed to measuring it) you must assess all resources used by a kernel, and compare those to relevant limits of the GPU you are running on.
For a single kernel, we can do this using several methods including the previously linked discussion of the Occupancy Calculator API.
I don’t know of a tool to do it for multiple kernels, but to a first order approximation you would simply aggregate or sum up the requirements of the kernels, in each category, and again compare them to relevant limits. If you stay within all relevant limits, then theoretically that group of kernels could run concurrently.
AFAIK CUDA makes no guarantees that two kernels will run concurrently, even if they satisfy relevant limits and various requirements. Therefore, AFAiK, a code design that depends on or requires kernel concurrency for correct behavior is by definition broken. So that is something I would keep in mind before building a large use-case around kernel concurrency.
AFAIK, kernel concurrency has a primary objective to allow for, in some situations, increased utilization of a GPU (and therefore higher work efficiency) in the presence of limited work description (ie. kernels/work description that are individually “too small” to fill the GPU, or achieve the highest parallel efficiency.)
in your example, when you see you increased the grid to 100, I assume this means the number of blocks. All GPUs will have a limit to the number of blocks that can run concurrently, and there is a theoretical hardware limit as well as also a possibly lower limit due to kernel code design and the occupancy considerations given that code design. In any event, there will be a limit to the number of blocks that can be simultaneously resident on SMs or executing “concurrently”. I would presume that it is possible in your test, consider all the kernel launches in question in aggregate, you may have exceeded this limit (which is also a function of the GPU you are running on) and therefore may not witness full or any concurrency.
Yes, there is a hardware limit around the number of concurrent kernels (“grids”) that can run. But this number is I think 16 or higher, (may vary by GPU) and so is not likely to be an explanation for why you can run only 1 or 2 kernels concurrently.