Your observations are consistent with the presence of out-of-bounds memory accesses in your code.
Not every out-of-bounds access will trigger an illegal address exception, only those that happen to fall outside of memory allocations made by the program, which typically have page granularity. This is very much analogous as to what happens on CPUs, where one needs a tight, byte-accurate checker like valgrind to find all out-of-bounds accesses, with most such accesses typically being undetectable by standard OS mechanisms.
When running under control of cuda-memcheck or cuda-gdb, it is likely that memory allocations occur at different absolute memory addresses than when running without, which can cause the different symptoms you observe: Either the out-of-bounds accesses fall outside the memory allocations, triggering an exception, or they fall inside the memory allocations (rounded to the next page boundary), causing the code to malfunction (“gets stuck”).
As to why the code works fine on a K40 but not on RTX2080, there are several plausible hypotheses I can offer off the top of my head. One is that there has always been a latent out-of-bounds issue in this code, except things “happened to work” / “you got lucky” on K40 because differences in the GPU architecture to Turing (different instruction sequences, potential difference in memory allocation) caused the effects of that to be benign on K40. A second hypothesis is that there is a architecture specific code-generation issue that negatively affects Turing, i.e. a compiler bug. A third hypothesis is that there is some code idiom used that does not strictly conform to the CUDA programming model, “happened to work” in the past, but doesn’t any more due to architectural changes in the latest GPU architectures, and that observable out-of-bounds accesses are a consequence (follow-on error) of that non-conformity.
If this were my code, I would do a detailed code review of the kernel in question, looking for non-conformance with the CUDA programming model, machine-specific tricks, off-by-one indexing errors, and generally anything that looks suspicious, questionable, or “fishy”. I would use code bisection and code simplification to try to narrow down the proximate source of the out-of-bounds accesses. To look into the hypothesis of a compiler bug, I would try reducing the machine-specific optimizations with -Xpxtas -O{3|2|1|0}, where the compiler defaults to -Xpxtas -O3. Note that if the issue disappears at lower PTXAS optimization levels that this is not yet proof of a compiler bug. One would have to do a detailed back-annotation and review of the generated machine code to have definite indication of a bug, which is a task that requires skill, practice, and patience.