dcgm - collectd - value too old messages killing syslog - duplicate UDP packets sent to influxdb

Hi

I am running the latest version of the Data Center GPU Manager on CentOS 7.7,
datacenter-gpu-manager-1.7.2-1.x86_64, collectd-5.8.1-1.el7.x86_64, and GPU driver version 440.33.01-1.el7.x86_64.

I keep getting these messages in syslog every time collectd runs the DCGM plugin:

Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/mem_copy_utilization-7; value time = 1580743003.000; last cache update = 1580743003.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/sm_clock-7; value time = 1580743003.000; last cache update = 1580743003.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/memory_clock-7; value time = 1580743003.000; last cache update = 1580743003.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/power_violation-7; value time = 1580743004.000; last cache update = 1580743004.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/thermal_violation-7; value time = 1580743004.000; last cache update = 1580743004.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/fb_total-7; value time = 1580743003.000; last cache update = 1580743003.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/fb_free-7; value time = 1580743003.000; last cache update = 1580743003.000;
Feb  3 10:16:53 node001 collectd: uc_update: Value too old: name = node001.edison/dcgm_collectd-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3/fb_used-7; value time = 1580743003.000; last cache update = 1580743003.000;

I do not have multiple plugins running:

[root@node001 ~]#  grep -i LoadPlugin /etc/collectd.conf | egrep -v '^[[:space:]]*#' | sort | uniq -c 
      1 LoadPlugin cpu
      1 LoadPlugin disk
      1 LoadPlugin interface
      1 LoadPlugin memory
      1 LoadPlugin network
      1 LoadPlugin processes
      1 LoadPlugin uptime
      1 LoadPlugin users
      1 LoadPlugin vmem

System time looks good:

chronyc> tracking
Reference ID    : 0CA79701 (12.167.151.1)
Stratum         : 4
Ref time (UTC)  : Mon Feb 03 15:15:21 2020
System time     : 0.000030027 seconds fast of NTP time
Last offset     : +0.000094171 seconds
RMS offset      : 0.000208562 seconds
Frequency       : 41.638 ppm slow
Residual freq   : +0.013 ppm
Skew            : 0.703 ppm
Root delay      : 0.065234661 seconds
Root dispersion : 0.028133124 seconds
Update interval : 128.4 seconds
Leap status     : Normal

Each server sending metrics to InfluxDB was configured with unique hostname, templated in Ansible, with collect.conf showing:

Hostname    "node001.edison"
FQDNLookup   true

So based upon collectd documentation I am wondering if there’s an issue with the Python plugin?

When I do a tcpdump it looks like this plugin sends 4 UDP packets whereas it looks like others send one. Is that expected?

10:42:01.245360 IP (tos 0x0, ttl 64, id 6341, offset 0, flags [DF], proto UDP (17), length 1344)
    10.0.9.1.54442 > 10.0.0.3.25826: [bad udp cksum 0x2241 -> 0x8128!] UDP, length 1316
E..@..@.@...
.       .
.....d..,"A....node001.edison..............     ......@.......dcgm_collectd....-GPU-3e739ca4-3d63-29f0-8a1b-79c2ef3a4ed7.....sm_clock.....0.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@...-GPU-611d6773-fb07-9598-b425-6c1e6f3bb85a.....retired_pages_sbe.....1....................retired_pages_dbe....................retired_pages_pending....................gpu_temp..............E@....power_usage........j.t...X@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total....................ecc_sbe_aggregate_total...............?....ecc_dbe_aggregate_total....................pcie_replay_counter....................gpu_utilization..............G@....mem_copy_utilization..............*@....sm_clock.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@...-GPU-caaf20a1-6b53-c4bb-2401-f9f2387b4b45.....retired_pages_sbe.....2...............?....retired_pages_dbe....................retired_pages_pending....................gpu_temp..............D@....power_usage........`.."..Y@....ecc_sbe_volatile_total................
10:42:01.245655 IP (tos 0x0, ttl 64, id 6342, offset 0, flags [DF], proto UDP (17), length 1319)
    10.0.9.1.54442 > 10.0.0.3.25826: [bad udp cksum 0x2228 -> 0xd8b8!] UDP, length 1291
E..'..@.@...
.       .
.....d..."(....node001.edison..............     ......@.......dcgm_collectd....-GPU-caaf20a1-6b53-c4bb-2401-f9f2387b4b45.....ecc_dbe_volatile_total.....2....................ecc_sbe_aggregate_total...............@....ecc_dbe_aggregate_total....................pcie_replay_counter....................gpu_utilization..............F@....mem_copy_utilization..............&@....sm_clock.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@...-GPU-83d50526-b00e-8be2-ad6b-53507a72ac28.....retired_pages_sbe.....3....................retired_pages_dbe....................retired_pages_pending....................gpu_temp..............F@....power_usage............x.\@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total....................ecc_sbe_aggregate_total....................ecc_dbe_aggregate_total....................pcie_replay_counter....................gpu_utilization..............G@....mem_copy_utilization..............(@....sm_clock.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@
10:42:01.245956 IP (tos 0x0, ttl 64, id 6343, offset 0, flags [DF], proto UDP (17), length 1349)
    10.0.9.1.54442 > 10.0.0.3.25826: [bad udp cksum 0x2246 -> 0x854a!] UDP, length 1321
E..E..@.@...
.       .
.....d..1"F....node001.edison..............     ......@.......dcgm_collectd....-GPU-4a685ad2-e0d5-c025-466e-697210fcf92c.....retired_pages_sbe.....4....................retired_pages_dbe....................retired_pages_pending....................gpu_temp..............E@....power_usage.........j.t..Y@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total....................ecc_sbe_aggregate_total...............?....ecc_dbe_aggregate_total....................pcie_replay_counter...............@....gpu_utilization..............D@....mem_copy_utilization..............&@....sm_clock.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@...-GPU-e9c60f61-e1fa-6e74-bf50-346fa0f5e9ed.....retired_pages_sbe.....5...............?....retired_pages_dbe....................retired_pages_pending....................gpu_temp..............D@....power_usage......... .rhYZ@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total....................ecc_sbe_aggregate_total...............@....ecc_dbe_aggregate_total....................pcie_replay_counter....................gpu_utilization..............C@....mem_copy_utilization..............$@....sm_clock.............H.@
10:42:01.246255 IP (tos 0x0, ttl 64, id 6344, offset 0, flags [DF], proto UDP (17), length 1358)
    10.0.9.1.54442 > 10.0.0.3.25826: [bad udp cksum 0x224f -> 0x8024!] UDP, length 1330
E..N..@.@...
.       .
.....d..:"O....node001.edison..............     ......@.......dcgm_collectd....-GPU-e9c60f61-e1fa-6e74-bf50-346fa0f5e9ed.....memory_clock.....5.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............`@...-GPU-4cf41e88-8ec8-5c6b-a306-a3494e864583.....retired_pages_sbe.....6...............@....retired_pages_dbe....................retired_pages_pending....................gpu_temp..............G@....power_usage..........Zd;.[@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total....................ecc_sbe_aggregate_total..............W@....ecc_dbe_aggregate_total....................pcie_replay_counter....................gpu_utilization..............D@....mem_copy_utilization..............&@....sm_clock.............H.@....memory_clock.............x.@....power_violation....................thermal_violation....................fb_total.............X.@....fb_free...............@....fb_used..............a@...-GPU-3458d51a-eba6-0c1c-00aa-3671fc06a8f3.....retired_pages_sbe.....7...............@....retired_pages_dbe....................retired_pages_pending....................gpu_temp..............E@....power_usage............xIZ@....ecc_sbe_volatile_total....................ecc_dbe_volatile_total................