The first time you call those kernels, they are going to run sequentially because of CUDA lazy initialization/lazy module loading. This sort of question come up from time to time so you can find various examples. Here is one. Also see here.
Lazy module loading is in effect by default on Windows on CUDA versions 12.3 and newer.
Even after you address that, WDDM behavior can sometimes make achieving the desired concurrency scenarios somewhat difficult to witness. You might also try both settings of Windows Hardware Accelerated GPU Scheduling. But if it were me, I would address the lazy loading topic first. A simple way to address that could be to add some code to each kernel to check for a flag and just exit. Call the kernel first with that exit flag set. That will get the kernel module loaded. Then later, when you call it with the flag not set, you can/may witness “normal” behavior.