Looking for an explanation / documentation on OpenACC present table dumps

I am debugging a C/C++ application that contains OpenACC accelerated code in shared objects. There were some errors with things “partially present” and there is a freeze when evaluating a present(variable). Thus I’m trying to understand the present table dumps, but I couldn’t find any documentation on the different fields so I’m not sure what it’s telling me exactly. Could I please get some hints to relevant documentation or an explanation?

The host, device, line and name fields are pretty self-explanatory, but:

  • why is presentcount always 0+n? What do those numbers mean?
  • what is with the “allocated block” lines, do those always map to an entry above? The addresses and sizes seem to indicate as much.
  • I’ve seen a lot of “deleted block” entries sometimes, what is that? They even show up with an “…empty…” present table sometimes
  • I’ve seen threadid=1 and threadid=2 in those dumps, is that important?

Sample present table:

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.6, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x15fcd70 device:0x7faa5f2fa200 size:3704 presentcount:0+5 line:1962 name:p[:1]
host:0x15fdbf0 device:0x7faa5f2fb200 size:20412 presentcount:0+19 line:1962 name:p->config[:1]
host:0x164ed60 device:0x7faa2f200000 size:6009344 presentcount:0+1 line:1964 name:p->buf_input[:p->sz]
host:0x1c09f70 device:0x7faa28c00000 size:12018688 presentcount:0+1 line:1980 name:p->Demosaic_Input[:p->sz]
host:0x2780380 device:0x7faa39200000 size:6009344 presentcount:0+1 line:1980 name:p->Demosaic_Output[:][:p->sz]
host:0x2d3b590 device:0x7faa39800000 size:6009344 presentcount:0+1 line:1980 name:p->Demosaic_Output[:][:p->sz]
host:0x32f67a0 device:0x7faa73800000 size:6009344 presentcount:0+1 line:1980 name:p->Demosaic_Output[:][:p->sz]
host:0x38b19b0 device:0x7faa2ec00000 size:6009344 presentcount:0+1 line:1980 name:p->Demosaic_Output[:][:p->sz]
host:0x3e6cbc0 device:0x7faa2ac00000 size:12018688 presentcount:0+1 line:1980 name:p->bufToneMap[:p->sz]
host:0x49e2fd0 device:0x7faa26c00000 size:12018688 presentcount:0+1 line:1980 name:p->bufGradientH0[:p->sz]
host:0x55593e0 device:0x7faa34000000 size:12018688 presentcount:0+1 line:1980 name:p->bufGradientV0[:p->sz]
host:0x60cf7f0 device:0x7faa34c00000 size:12018688 presentcount:0+1 line:1980 name:p->bufGradientH1[:p->sz]
host:0x6c45c00 device:0x7faa32000000 size:12018688 presentcount:0+1 line:1980 name:p->bufGradientV1[:p->sz]
host:0x77bc010 device:0x7faa32c00000 size:12018688 presentcount:0+1 line:1980 name:p->bufIntensity0[:p->sz]
host:0x8332420 device:0x7faa36000000 size:12018688 presentcount:0+1 line:1980 name:p->bufIntensity1[:p->sz]
host:0x8ea8830 device:0x7faa26000000 size:12018688 presentcount:0+1 line:1980 name:p->bufThresholdAdaptive0[:p->sz]
host:0x9a1ec40 device:0x7faa38000000 size:12018688 presentcount:0+1 line:1980 name:p->bufThresholdAdaptive1[:p->sz]
host:0xa595050 device:0x7faa2f800000 size:6009344 presentcount:0+1 line:1980 name:p->bufFir[:][:p->sz]
host:0xbc81860 device:0x7faa31800000 size:6009344 presentcount:0+1 line:1980 name:p->bufFir[:][:p->sz]
host:0xd36e070 device:0x7faa5f600000 size:6009344 presentcount:0+1 line:1980 name:p->bufFir[:][:p->sz]
host:0xea5a880 device:0x7faa38c00000 size:6009344 presentcount:0+1 line:1980 name:p->bufFir[:][:p->sz]
host:0x10147090 device:0x7faa30c00000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMIn[:][:p->sz]
host:0x118338a0 device:0x7faa36c00000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMIn[:][:p->sz]
host:0x12f200b0 device:0x7faa30000000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMIn[:][:p->sz]
host:0x1460c8c0 device:0x7faa2cc00000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMIn[:][:p->sz]
host:0x15cf90d0 device:0x7faa2e000000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMOut[:][:p->sz]
host:0x173e58e0 device:0x7faa2c000000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMOut[:][:p->sz]
host:0x18ad20f0 device:0x7faa28000000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMOut[:][:p->sz]
host:0x1a1be900 device:0x7faa2a000000 size:12018688 presentcount:0+1 line:1980 name:p->bufCCMOut[:][:p->sz]
allocated block device:0x7faa26000000 size:12018688 thread:1
allocated block device:0x7faa26c00000 size:12018688 thread:1
allocated block device:0x7faa28000000 size:12018688 thread:1
allocated block device:0x7faa28c00000 size:12018688 thread:1
allocated block device:0x7faa2a000000 size:12018688 thread:1
allocated block device:0x7faa2ac00000 size:12018688 thread:1
allocated block device:0x7faa2c000000 size:12018688 thread:1
allocated block device:0x7faa2cc00000 size:12018688 thread:1
allocated block device:0x7faa2e000000 size:12018688 thread:1
allocated block device:0x7faa2ec00000 size:6009344 thread:1
allocated block device:0x7faa2f200000 size:6009344 thread:1
allocated block device:0x7faa2f800000 size:6009344 thread:1
allocated block device:0x7faa30000000 size:12018688 thread:1
allocated block device:0x7faa30c00000 size:12018688 thread:1
allocated block device:0x7faa31800000 size:6009344 thread:1
allocated block device:0x7faa32000000 size:12018688 thread:1
allocated block device:0x7faa32c00000 size:12018688 thread:1
allocated block device:0x7faa34000000 size:12018688 thread:1
allocated block device:0x7faa34c00000 size:12018688 thread:1
allocated block device:0x7faa36000000 size:12018688 thread:1
allocated block device:0x7faa36c00000 size:12018688 thread:1
allocated block device:0x7faa38000000 size:12018688 thread:1
allocated block device:0x7faa38c00000 size:6009344 thread:1
allocated block device:0x7faa39200000 size:6009344 thread:1
allocated block device:0x7faa39800000 size:6009344 thread:1
allocated block device:0x7faa5f2fa200 size:4096 thread:1
allocated block device:0x7faa5f2fb200 size:20480 thread:1
allocated block device:0x7faa5f600000 size:6009344 thread:1
allocated block device:0x7faa73800000 size:6009344 thread:1
deleted block   device:0x7faa5f2fa000 size:512 threadid=1

If I remember correctly, that’s the reference counter. Data regions can be nested so each time a variable enters a new region, it’s reference counter gets incremented. On exit of the data region, the counter is decremented. If the count is zero, then the device variable is deleted.

  • what is with the “allocated block” lines, do those always map to an entry above? The addresses and sizes seem to indicate as much.
  • I’ve seen a lot of “deleted block” entries sometimes, what is that? They even show up with an “…empty…” present table sometimes

These both have to do with our pool allocator. Instead of actually deleting and reallocating device memory, the runtime reuses data blocks to save on the overhead of reallocating data each time. The allocated blocks are the memory in use while the deleted blocks is the memory available for reuse.

You can disable this by setting the environment variable “NV_ACC_MEM_MANAGE=0”.

  • I’ve seen threadid=1 and threadid=2 in those dumps, is that important?

I believe that indicates the thread that created the data in a multi-threaded program.

That would be host threads then I assume?

If there’s no documentation for this, may I suggest to write some? This sounds like a nice tool for debugging memory related issues when moving to unstructured data statements…

Correct.

If there’s no documentation for this, may I suggest to write some?

I sent a note to the GPU compiler team’s manager to see if she wants to document it. It’s really intended for our team’s internal use, but I will ask users to use it when I’m helping them debug their code.

The runtime prints out the present table whenever there’s a wrong present statement or similar I think? That’s not really an internal thing. A lot of users will likely see those at some point when programming OpenACC code with your compilers?

But I get your point. Thank you for asking!

Do you know if multi-threaded programs might have issues with present table entries?

That shouldn’t be a problem give the host threads share the same memory space for both the host and device. While I don’t often use it, I’ve put OpenACC directives within OpenMP host parallel regions without issue.

I am actually encountering present table dumps where presentcount is not 0+n but also n+0 – what’s the significance of that?

There’s two present reference counters. The first is the static count and the second is the dynamic count. Which I believe the static counter correlates to entry into structured data regions and “declare”, while dynamic is for unstructured data regions.