It looks like a compiler code generation issue (defect) to me. On CUDA 12.0, I note that if I compile with -G the code does not hang, and I note that if I replace size = cnt_such(); with size = 1995840; it also does not hang.
My suggestions:
- retest on the latest available CUDA version if you are not already on that version.
- if the issue is still reproducible there, file a bug.