Kernel orders when using different metrics

Hi
I see that kernel IDs changes when I use different metrics. The version I am using is 2022.2. The program is based on Torch and I can attach the code, but let’s first see if this kind of behavior is normal.

I used two commands as below:

~/NVIDIA-Nsight-Compute-2022.2/nv-nsight-cu-cli --kill on -c 300 python3 main.py
~/NVIDIA-Nsight-Compute-2022.2/nv-nsight-cu-cli --kill on -c 300 --metrics smsp__inst_executed.sum python3 main.py

I have attached the outputs and the weird thing is that in the first command, I see this order:

==PROF== Profiling "GRU_elementWise_fp" - 200 (201/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 201 (202/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 202 (203/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 203 (204/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 204 (205/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 205 (206/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 206 (207/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 207 (208/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 208 (209/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 209 (210/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 210 (211/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 211 (212/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 212 (213/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 213 (214/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 214 (215/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 215 (216/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 216 (217/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 217 (218/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 218 (219/300): 0%....50%....100% - 9 passes
==PROF== Profiling "Kernel" - 219 (220/300): 0%....50%....100% - 9 passes
==PROF== Profiling "GRU_elementWise_fp" - 220 (221/300): 0%....50%....100% - 9 passes
==PROF== Profiling "CatArrayBatchedCopy" - 221 (222/300): 0%....50%....100% - 9 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 222 (223/300): 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_sgemm_32x32_sliced1x4_tn" - 223 (224/300): 0%....50%....100% - 9 passes


and the second run:

==PROF== Profiling "GRU_elementWise_fp" - 200 (201/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 201 (202/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 202 (203/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 203 (204/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 204 (205/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 205 (206/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 206 (207/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 207 (208/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 208 (209/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 209 (210/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 210 (211/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 211 (212/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 212 (213/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 213 (214/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 214 (215/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 215 (216/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 216 (217/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 217 (218/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 218 (219/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 219 (220/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 220 (221/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 221 (222/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 222 (223/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 223 (224/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 224 (225/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 225 (226/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 226 (227/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 227 (228/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 228 (229/300): 0%....50%....100% - 1 pass
==PROF== Profiling "Kernel" - 229 (230/300): 0%....50%....100% - 1 pass
==PROF== Profiling "GRU_elementWise_fp" - 230 (231/300): 0%....50%....100% - 1 pass
==PROF== Profiling "CatArrayBatchedCopy" - 231 (232/300): 0%....50%....100% - 1 pass
==PROF== Profiling "unrolled_elementwise_kernel" - 232 (233/300): 0%....50%....100% - 1 pass
==PROF== Profiling "ampere_sgemm_32x32_sliced1x4_tn" - 233 (234/300): 0%....50%....100% - 1 pass

As you can see kernel 220 is GRU_elementWise_fp is both outputs, but after that the orders start to change where CatArrayBatchedCopy is 221th when no metric is specified (though is has 9 passes per kernel) while the same kernel is 231th in the second command.

What do you think about that?
nsight_22.2.txt (1.8 MB)
nsight_22.2_inst.txt (245.2 KB)

I can see in the results that there are multiple streams in the application. For example, see here:

void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor, at::detail::Array<char *, (int)1>>(int, T2, T3), 2023-Apr-06 16:56:13, Context 1, Stream 7

void transpose_readWrite_alignment_kernel<float, float, (int)1, (bool)0, (int)6, (int)5, (int)3>(cublasTransposeParams, const T1 *, T1 *, const T2 *), 2023-Apr-06 16:56:27, Context 1, Stream 25

Within a stream, kernel order is guaranteed, but between streams there is no guarantee without explicit synchronization. It’s very possible that multiple streams cause the overall order of executed kernels to change between runs. And when the overhead changes, with more passes for example, this could cause more changes in scheduling.

What do you mean by “explicit synchronization”? Is that a command line option or something inside the program’s code?
This property causes some problems I asked earlier in this post. The problem is that the kernel ID that Nsight Compute gives, may not be the same as NVbit or even Nsight Compute itself with different set of metrics.

There are various ways to synchronize with APIs like cudaDeviceSynchronize or using events. For example, see slide 12 here https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Nsight Compute only has the capability to use a monotonically increasing number for Kernel IDs as they are encountered. If you use a single stream, the IDs should be repeatable as long as the execution is deterministic. Once multiple streams are used without explicit synchronization, the kernels can occur in different orders, which can change the IDs. Using different metrics with only a single deterministic stream shouldn’t change the IDs, in theory. Let me know if you see this in practice and we can investigate further.

Hi
I would like to ask this question that how much it is possible to make profiling deterministic? I am sure that you have also encountered this observation and confirm that having a deterministic behavior is crucial. Talking about Nsight Compute, if the order of streams changes from 1,2,3 to 1,3,2 over two profiling, then analyzing the results becomes more challenging.

Using explicit synchronization inside the code is one approach. But that may break the program’s original performance (if logic is not affected either). I am specifically wondering about this problem with MLPerf codes. As they are benchmarks, modifying the codes, is not simple generally. Also, modifying the code is not very accepted in general.

I understand the issue with adding synchronization and modifying the code. That can definitely change the behavior and impact performance.

I think the bigger question is - if the application is not deterministic, how can we expect profiling to be deterministic. The profile will represent the behavior of the run, and if that behavior changes between runs, the profile will change. What specific characteristic(s) are looking for determinism? The stream scheduling won’t be deterministic without synchronization, that’s a characteristic of CUDA.

Having said that, if what you’re looking for is a way to compare “the same” kernel across runs, even if they have different IDs, that should be doable using the Stream ID plus the Global ID (see image). Each stream will have the same order of kernels as long as they were enqueued in the same order and then you can sort by Global ID giving you the 1st, 2nd, 3rd, etc… kernel from a stream. And then you could, for example, compare the 3rd kernel from Stream 7 in one run with the 3rd kernel from Stream 7 (or a different stream if stream IDs change) in another run. Is that what you are trying to do? Would having kernel names in the format “[StreamID, ID]kernelName” solve the problem? Or some local ID per stream?

image

Thanks for the idea.

In the raw file, I see the following information:
ID from 0 to N
Context = launch__context_id
Stream = launch__stream_id

That means the numbers in the Stream column are the same as launch__stream_id.
I was wondering which one do you refer by saying “Stream ID plus the Global ID”? In the figure you highlighted launch__stream_id.

In the figure below, ID numbers 0-66 belong to stream #7 while IDs 67 and 71 belong to stream #25, and so on.
image

As an example, with the proposed method, having a tuple like [stream, ID, name] creates [7,66,at::native] and [25,67,transpose]. Assuming IDs 66 and 67 are swapped in the second run, but I don’t that.

Now, if I use [7,66,at::native] (based on the first run) to analyze at::native with more metrics, I am not able to catch that in the second run.

So, I think having a “local ID per stream” is what I am looking for. I may be able to manually do a post processing by reading the raw file and adding another column for my own purpose.

If you know a better approach, I appreciate if you share that.