'cuda HW' field is missing

Hi,
I’v been using nsight systems to profile Tensorrt and onnx python scripts in offline cases, it worked fine and can show “CUDA HW” and all subfields.

But when I use cli command to profile vllm (in no-eager mode where models should be converted to cuda graphs) python scripts on openai-api backed, the output file show like this, in which the “CUDA HW” and all subfields are missing. Also, in the “Threads” field there isn’t “CUDA API” subfield.

The offline and online tests have all nsys cli commands and params being exactly the same:

    sudo nsys profile \
        --gpu-metrics-devices=0 \
        --trace="cuda,nvtx" \
        --cuda-graph-trace="graph" \
        --cuda-memory-usage="true" \
        --output="path/to/myProfile" \
        --force-overwrite true \
        path/to/my/python/execution \
        "path/to/my/python/script.py"

The target info is: Rocky Linux | NVIDIA A100-SXM4-80GB | cuda driver version>=545 | CUDA version >=12.4 | nsight systems 2024.7.1.

Thank you for your help in advance!

@liuyis can you respond to this.

Hi @0-0, was the application exiting gracefully on Linux? The CUDA trace feature holds a buffer within the application’s process(es), and if it was forcely killed, the buffer might not be flushed and CUDA trace data can be missing.

One thing you can try is adding the --duration=<seconds> option and set it a little shorter than the application’s execution time, that will allow the collection to finish ealier and make sure buffer is flushed.

Hi, thank you for your reply. I’ve tried the --duration= option and still can’t show the missing fileds.

My python script (the script path in the nsys command) is as follow:

    import os
    from multiprocessing import Process

    def run_stress_test_client(port, script_path):
        python_executable = "path/to/my/python"
        client_command = [
            python_executable, ONLINE_SCRIPT_PATH,
            "--backend", "openai-chat",
            "--base-url", f"http://localhost:{port}",  
            "--endpoint", "/v1/chat/completions",        
            ...(other params)
        ]
        os.execv(python_executable, client_command) 
        
    if __name__ == "__main__":
        processes = []
        for port in ["8000", "8001", "8002"...]: 
        # Each CUDA device corresponds to an online port, and we send requests to these ports at the same time
            print(f"\n\t Running process for port: {port}\n")
            process = Process(
                target=run_stress_test_client,
                args=(port, script_path),
            )
            process.start()  
            processes.append(process)
        
        # wait until all process finished
        for process in processes:
            process.join()
        
        print("All processes completed.")

The core content of “ONLINE_SCRIPT_PATH” in the above run_stress_test_client() function is as follow:

    import asyncio
    async def send_requests(
        backend, url, port, model_id, input_requests, max_concurrency...
    ):
        tasks: List[asyncio.Task] = []
        async for request in request_list(...):
            request_func_input = RequestFuncInput(model,prompt,url,port...)
            tasks.append(
                asyncio.create_task(_request_func(...))
                )
        outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
        # process outputs...

    if __name__ == "__main__":
        benchmark_result = asyncio.run(
            send_requests(...)
        )

I’ve found that if I don’t use asyncio, and modify the script called by ‘multiprocessing’ library into a totally offline script that use a local model, the problem will be solved, and all datafields will show fine.

But I want to use nsight on online serving cases. Could you give some suggestions on why this would happen, and how to use nsight on online serving cases correctly?

(I think my previous content may cause some misunderstandings, for I wasn’t much clear about the source of the question when I first commited the post. So I edited the first commited post content. So sorry for troubling you. )

Thank you again for your help.

By “online serving cases”, does it mean the model is running on a remote system or process? If that’s the case, Nsys cannot get CUDA activities from the remote system or process. The CUDA activities has to be generated by the target application you launched through Nsys, or any child process of it, in order for Nsys to collect.

The suggestion is to use Nsys to profile the actual process (whether it is on a remote system or a different process on the same system) that runs the model and therefore has CUDA usage.

Hi, the model is indeed running on a remote process on the gpu in--gpu-metrics-devices in the same system. I’ll use nsight to profile the server process. Thank you!