Low-overhead switch performance counter reads?

I am trying to understand why Adaptive Routing (AR) is not helping on some of our workloads, and am starting with performance counter measurements at the finest granularity possible. Reading the extended performance counters on the local switch (QM8790) using “perfquery -x -a -l $LID” takes about 27 milliseconds. Importing the “pma_query_via()” calls used by perfquery into my code shows that each call to “pma_query_via()” takes over 170 microseconds, so the 80 calls required to get the counters for all 80 ports takes about 14 milliseconds. This is an eternity – each HDR100 link can move more than 150 MiB in each direction during this interval, and I have almost 500 switch chips in the full system that I may need to query.

Is there a faster way to get the extended performance counters from multiple ports on either the unmanaged local switch and/or on the CS8500 central switches?