How to access (or compute) block durations and warp durations from raw data?


I noticed that there are histograms of block and warp durations under the Details View, Launch Statistics in Nsight Compute. I’m hoping to access the raw data of these durations for my research. But I didn’t find it in the Raw View. Is there a way to get such data or to compute these results?



Can anyone from NVIDIA please answer this question? Thanks!

Hi Ming,

to answer your question in short, there is currently no way to retrieve this data other than parsing the report file yourself. There is some information on how to do this here Using the python rule system to read this data would be an alternative, easier approach, but we found that there is a bug with respect to these metrics, due to which that is currently not possible.

We will look into fixing this problem in a future release of Nsight Compute.

As background info:

The metrics you are looking for are called sass__block_histogram and sass__warp_histogram. This can be seen from the file LaunchStatistics.section within the sections directory of the Nsight Compute installation. .section files define what is collected for each kernel launch, and how the data is shown in the report. You can refer to for more details on this.

Those metrics are called “instanced metrics”, since they contain values for multiple instances of the represented domain (in this case warp/block runtime bins). Since they also happen to contain a non-instanced value in this case (the sum of all per-bin counts), the instanced values are not shown on the Raw page.

Thanks for your detailed answer. That’s very helpful! To clarify, is it possible to get the “instanced values” of per-bin counts by parsing the report file at this moment? Or is it because of the bug you referred this currently cannot be done?

I found that sass__block_histogram and sass__warp_histogram are not available on GV100. This is actually also documented in the known issues section of Nsight Compute document. Will these metrics be provided in the future?

I really look forward into Nsight catching up with nvprof, which reports instanced metrics nicely. Unfortunately, I don’t think nvprof provides block/warp histogram.


I took your suggestion and had an attempt on reading the data through the rules system. It would be great if the following issue can be fixed. It’s probably the bug you referred? I really need these metrics for my research. Please let me know if I can help test or anything.

The instanced values all seem were overwritten to be zero. I added the following code to the, apply() function.

from NvRules import metric_instances
block_hist = action.metric_by_name("sass__block_histogram")               

The output is 47[0.0, 0.0, …, 0.0] for a given profile, while the histogram displays some non-zero data. The number of bins is equal to the num_instances. But the content is wiped to zero.


Yes, that is the bug I was referring to, and it will be fixed in a future release. We are also looking into providing more details on how to parse the report file itself, e.g. by means of sample code.

Hi Felix,

I managed to parse the report file and got block/warp histogram. But Nsight compute profiling seems not working with MPS for Pascal? It works for a Volta.

Also, will sass__block(warp)_histogram be supported on NVIDIA in a future release?


I assume your question is if those metrics will be supported on GV100 and newer architectures in a future release? The answer to that is yes, we are planning on supporting those.

What exact Pascal GPU are you using?

Oh right, that is a typo. I meant GV100. Thanks for answering that!

The Pascal GPU I used is 1070. I read that GP10x is supported. By the way, nsight compute profiles fine without MPS. It’s when MPS is turned on that nsight compute reports that profiling is not supported.

It’s GTX 1070. Any clue why Nsight Compute doesn’t work with MPS turned on for this GPU?

I do not currently know why exactly it wouldn’t work with MPS, but we will check this internally and update here once I have more information.

Thanks for looking at this and please let me know if any information is needed to reproduce this issue.

Meanwhile, a quick question that does Nsight Compute profiler work with concurrent kernel execution? I tried it with two kernels that are supposed to be able to execute concurrently, but it seems that Nsight only profiles one kernel at a time. In another word, concurrency is disabled during profiling, is it true?

If so, is there any other way to get block durations (except reading global clock) for concurrent kernel execution?



Yes, current version of Nsight Compute serializes the kernels and profiles them one after the other.

Hi Ming,


Note that the issue of reading the data using the python rule system has been fixed in Nsight Compute version 2019.5 (which is part of the CUDA Toolkit 10.2).

That’s great! Thanks for the reply. Does Nsight Copmuter work with concurrent kernel execution and MPS now?