I am hoping to allow for multiple processes to query from DCGM host-engine concurrently for the newly added profiling fields (i.e. sm_active, fp16_active, etc). Currently this is not allowed by nv-hostengine because these fields must be exclusively watched. Gives an error as such:
“Unable to watch profiling metrics for clientId 2588. Already watched by clientId 2246 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:892] [DcgmModuleProfiling::ProcessWatchFields]”
Despite the fact that each query is targeted to a different GPU on a multi-GPU system. In my case, timing delay is OK so even a serialized handling of the queries would be acceptable.
The workaround currently used is a “try-wait-retry” scheme from the multiple processes, where I watch-unwatch the fields continuously to allow for multiple processes to connect to the host engine and query the metrics. In the case of multiple processes trying to connect and query at the same time, only one succeeds and the others will retry with a small delay. However, this works for a while but eventually brings the nv-hostengine to a deadlock state that cannot be debugged.
I was wondering if someone can help with the following questions:
- Will concurrent queries to hostengine for the profiling fields be supported anytime soon?
- What is the proper/recommended workaround, considering that the concurrent queries are at very low frequency, and can be serialized when contention happens ?
- Is there an obvious bug with my current “try-wait-retry” approach or could it be something dcgm implementation did not consider?