nvtxNameCudaStreamA off-by-one bug

It seems like nvtxNameCudaStreamA in CUDA 11.3.0 has an off by one bug of sorts. I have an application with three “compute” streams and one “data transfer” stream. I name the three compute streams “compute 1”, “compute 2”, and “compute 3”. I name the data transfer stream “data stream”. When I profile the app with nsys and open with nsight, I see the kernel activity on the three compute streams and the data transfer on the data transfer stream as expected, but compute stream 1 labeled with a default name, compute stream 2 is labeled with “compute 1”, compute stream 3 is labeled “compute 2”, and the data transfer stream is labeled “compute 3”. No streams are labeled with “data stream”. When I export to HDF5 I can see that the stream names are all present in the “TARGET_INFO_NVTX_CUDA_STREAM” dataset, but the corresponding “streamId” fields are all one higher than the values used in the “CUPTI_ACTIVITY_KIND_KERNEL” and “CUPTI_ACTIVITY_KIND_MEMCPY” datasets. For example, the kernels on the compute streams have streamID values 14, 15, 16, and the memcpy on the data transfer stream as streamId value 17, but the four named streams in TARGET_INFO_NVTX_CUDA_STREAM have stream IDs 15 (compute 1), 16 (compute 2), 17 (compute 3), and 18 (data stream).

I don’t see how I could possibly manage to get things “off-by-one” since I’m only working with cudaStream_t types. Has anyone else noticed this problem?

1 Like

Hi David, I just encountered the same problem with CUDA 11.2.
Have you managed to fix this ?

I was thinking of decreasing by one the stream IDs in TARGET_INFO_NVTX_CUDA_STREAM with a script as a temporary fix, but maybe you found a better solution ?

So after a few tries, it seems that modifying the qdrep file directly corrupts it, with an error message complaining about the function QuadDCommon::StreamSectionsManager::readSectionTableOffset.

I then tried to modify this function at assembly level, hoping to reduce by one the stream IDs, in libStreamSections.so, but without success as of now. I do not have time to explore this further right now, but this could be a solution…

Btw, I also checked nvvp, and nvtxNameCudaStreamA works as expected using the deprecated profiler.

I didn’t find a fix or workaround so I basically abandoned the concept of naming streams. I suppose you might be able to create a “dummy” stream and use it when naming streams so that when they get off-by-one the names will be on the “real” streams as you really want (but I haven’t actually tried this).

Thanks for your reply ! I had the same idea but I would prefer not modifying my CUDA application, to avoid relying on an buggy behaviour. I think I will also give up on this for now, until a fix comes up in a future release, but if I do have time to explore a little more I’ll keep this thread updated.
I also filed a bug report in Nsight Systems (since the feature works in NVVP), so hopefully this will be fixed soon.

Hi David and Julien,

Sorry for the slow response – this was just brought to my attention. I tried to reproduce this with Nsight Systems, and it seems to work fine with the most recent build. I get this result:

…from this code:

    cudaStream_t stream1, stream2, stream3;

    nvtxNameCudaStreamA(stream1, "Stream 1");
    nvtxNameCudaStreamA(stream2, "Stream 2");
    nvtxNameCudaStreamA(stream3, "Stream 3");

    const int threadsPerBlock = 1;
    const int blocks = 1;

    for (int i = 0; i < 10; ++i)
        Kernel1<<<blocks, threadsPerBlock, 0, stream1>>>(nullptr);
        Kernel2<<<blocks, threadsPerBlock, 0, stream2>>>(nullptr);
        Kernel3<<<blocks, threadsPerBlock, 0, stream3>>>(nullptr);


Can you try with the newest Nsight Systems and let me know if you’re still seeing this problem?

Hi Jason,

Thanks for you reply. I do not have time currently to go through the update process, but if I’ll let you know if I do in the future. If I remember correctly, I had the same problem on a MWE, so the update should be enough to fix the issue.

Best regards,