CUDA Dynamic Parallelism 1ne active thread only

I’m trying to use cuda dynamic parallelism feature, but I find that sometime child Grid only have 1 active thread, through I assign 256 threads to it.
The thing is that, I launch one child Grid and let every thread of it print their own thread id, some time I got:

worker 5 at sm 0 tid 0 is child grid
worker 5 at sm 0 tid 1 is child grid
worker 5 at sm 0 tid 2 is child grid
worker 5 at sm 0 tid 3 is child grid
worker 5 at sm 0 tid 4 is child grid
worker 5 at sm 0 tid 5 is child grid
worker 5 at sm 0 tid 6 is child grid
worker 5 at sm 0 tid 7 is child grid
worker 5 at sm 0 tid 8 is child grid
worker 5 at sm 0 tid 9 is child grid
worker 5 at sm 0 tid 10 is child grid
......

which indicates all threads output their tid, but some time I go

worker 5 at sm 1 tid 0 is child grid
worker 6 at sm 4 tid 0 is child grid
worker 7 at sm 2 tid 0 is child grid
worker 5 at sm 5 tid 0 is child grid
....

which means all block of child grid only have one active thread. I’m pretty sure two tests run the same code and all output is readed. What’s the wrong with it, is this a bug of CDP? My GPU is 1050 Ti nvidia-smi output is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   58C    P0    N/A /  N/A |    670MiB /  4040MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------

system ubuntu 18.04 with kernel 5.4.0-65-generic

if you’re not seeing the output you expect, my first guess would be that you are not using in-kernel printf properly, for example overrunning the buffer. You might want to read all of this section of the programming guide carefully. I really can’t say anything else without seeing a complete example code.

Thanks for replying, that’s my fault. I use share memory to record something and tid 0 should initiate it , some times non-initialized share memory happens to get value -1, which cause all other threads exit and that’s why this weired thing happend occasionally… Anyway I’ve solved my problem and thx for your reply.