How to get OS kernel call stack using Nsight Systems?

Thank you for your proposal. We looked into this feature previously, and while it seems to be relatively straightforward to implement, we don’t currently have plans for it. We only show one level deep in the kernel space — the exact function hit by the sampling mechanism.

Can you please explain why you need to see the kernel space backtrace in this case?

Linux perf should be able to collect both kernel and user space call stacks. Beware though that the sampling logic is different between NSys and Linux perf. Our algorithm is more biased towards sched-out events. If the profiled application has a lot of wake-and-sleep cycles (which is common when something uses sched_yield), the sampling results will have higher percentage dedicated to the functions that cause the thread to become blocked.