DPA caches measured latency/bandwidth

When I try to measure the cache latency of accessing a single 64b local variable of a DPA kernel (dpa_global) function, I’m seeing upwards of 70ns per read. This is pretty strange, as if the DPA core caches are off. This experiment looks something like this

dpa_global kernel_thread(uint64_t arg) {
uint64_t data = 0
timestart()
for int i < loops:
load from &data
timestop()
}

Is there something off about this experiment?