Allowing multiple processes to watch DCGM profile fields

Hello.

I am hoping to allow for multiple processes to query from DCGM host-engine concurrently for the newly added profiling fields (i.e. sm_active, fp16_active, etc). Currently this is not allowed by nv-hostengine because these fields must be exclusively watched. Gives an error as such:
“Unable to watch profiling metrics for clientId 2588. Already watched by clientId 2246 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:892] [DcgmModuleProfiling::ProcessWatchFields]”

Despite the fact that each query is targeted to a different GPU on a multi-GPU system. In my case, timing delay is OK so even a serialized handling of the queries would be acceptable.

The workaround currently used is a “try-wait-retry” scheme from the multiple processes, where I watch-unwatch the fields continuously to allow for multiple processes to connect to the host engine and query the metrics. In the case of multiple processes trying to connect and query at the same time, only one succeeds and the others will retry with a small delay. However, this works for a while but eventually brings the nv-hostengine to a deadlock state that cannot be debugged.

I was wondering if someone can help with the following questions:

  1. Will concurrent queries to hostengine for the profiling fields be supported anytime soon?
  2. What is the proper/recommended workaround, considering that the concurrent queries are at very low frequency, and can be serialized when contention happens ?
  3. Is there an obvious bug with my current “try-wait-retry” approach or could it be something dcgm implementation did not consider?

Thank you.

Will concurrent queries to hostengine for the profiling fields be supported anytime soon?

We are aware of the issue. We are still prioritizing this against other DCGM features.

What is the proper/recommended workaround, considering that the concurrent queries are at very low frequency, and can be serialized when contention happens

What we normally recommend is that one client watch the metrics for all GPUs and all other clients observe them. Are you using DCGMI or the bindings?

In the dcgmi case, you could watch the metrics in one client like:
dcgmi dmon -e 1001,1002,1003,1005 -d 30000

That would cause samples to be collected for the above metrics every 30 seconds.

Then when you wanted to snapshot them, you could run:

dcgmi dmon -e 1001,1002,1003,1005 -c 1 --nowatch -i 0

That would snapshot GPU 0.

On the API side, you can do the same with dcgmWatchFields() from one client and calling dcgmGetLatestValuesForEntities() from another client.

Is there an obvious bug with my current “try-wait-retry” approach or could it be something dcgm implementation did not consider?

I would recommend keeping the profiling watches active at all times rather than just when you take a sample. Otherwise, your sample just represents a few ms of time between when the watch was setup and when we could get a first sample. The setup time for profiling watches is significant. You only want to pay that once at startup.

I hope that helps.

Hi Brent,

Thank you so much for the response, will try the recommendations and looking forward to the newer releases!!