I have also encountered the issue of kernel replay being unable to profile in scenarios with large device memory usage. I had to resort to using application replay, but it is too slow and there is a possibility of mismatches with each replay. I hope that NCU can address this problem in the future and ensure that kernel replay can be properly profiled in scenarios with large memory usage.
Thanks.