Use nsight systems to profile cuda program in kubernetes pod

I’ve been trying to use nsight systems to profile my cuda program in kubernetes pod last week. But nsight systems didn’t start generate report when my program finished. I found that nsys reported error log like this:

Connection to Agent lost. This is most likely a bug. Internal reason: ‘End of file’. Please refer to the troubleshooting section of the docs: User Guide — nsight-systems

I’m using cuda driver 570 and cuda toolkit 12.8. I download the nsight system cli 2025.5.1.121 and installed in the pod manually. I asked LLM how to fix the bug, and here are the solutions i tried.

  1. Add pod capabilities: SYS_PTRACE, SYS_ADMIN, IPC_LOCK. Didn’t work.

            securityContext:
              runAsUser: 0
              runAsGroup: 0
              allowPrivilegeEscalation: true
              capabilities:
                add:
                  - SYS_PTRACE
                  - SYS_ADMIN
                  - IPC_LOCK
    
  2. Enlarge /dev/shm to 16GB. Didn’t work

            - mountPath: /dev/shm
              name: dshm
    
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 16Gi
    
  3. Enable hostPID. Solved the “end of file“ bug, but got new bug log about processing QDSTRM file. The QDSTRM file was generated but .nsys-rep file was not generated.

          hostPID: true          # Allow access Host PID namespace
    

    Importer error status: An unknown error occurred. Unable to retrieve the importer version: skipping importation of the QDSTRM file.

  4. According to User Guide — nsight-systems , if .nsys-rep file is not generated, we can use QdstrmImporter to convert QDSTRM file to nsys-rep file. QdstrmImporter is in in the Host-x86_64 directory, and I tried to run it. I got a log that libcap2.so can’t be found. I copied the libcap2.so from Target-x86_64 to Host-x86_64. Finally, the nsys profile command run sucessfully.

Just record my debug experience here, in case someone else meet the same problem.

@mhallock can you help with this?

Hi @jianglei1212 , thanks for sharing your experience. I have a few questions so we can figure out what is happening in your environment.

How have you installed nsys into the container? The file libcap2.so should have already been present in both the host and target directories. Can nsys operate correctly in the container on its own? Here is a simple demonstration using the Nsight Systems container on NGC:

$ kubectl run -i nsys-test --rm --image nvcr.io/nvidia/devtools/nsight-systems-cli:2025.3.1-base-ubuntu22.04 -- nsys profile sleep 5

If you don't see a command prompt, try pressing enter.
Collecting data...
Generating '/tmp/nsys-report-43e7.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
	/report1.nsys-rep
pod "nsys-test" deleted

It would be interesting to test that using your container and the NGC one, as I did. That should help us identify if there is anything amiss about nsys installed within your container.

In the event that these both fail for you – due to the need for hostPID to make things work, then you should be able to still verify your container’s basic functioning using docker to run the same command on some other machine outside of the k8s cluster.

With respect to the agent “end of file” issue, I am unsure of exactly how that might occur. Which container runtime and CNI you are using in your k8s cluster?