UVM sample consumes 100% GPU Usage on P4 card

Kernel Version: 4.4.32
Nvdia driver Version: 375.39

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 0000:14:00.0     Off |                  Off |
| N/A   46C    P0    25W /  75W |    113MiB /  8116MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

lspci info:

14:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
        Subsystem: NVIDIA Corporation Device 11d8
        Flags: bus master, fast devsel, latency 0, IRQ 194
        Memory at c4000000 (32-bit, non-prefetchable) 
        Memory at 383fe0000000 (64-bit, prefetchable) 
        Memory at 383ff0000000 (64-bit, prefetchable) 
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: nvidia

Sample code as follows:

#include <string.h>
#include <stdio.h>

struct DataElement
{
  int value;
};

__global__
void Kernel(DataElement *elem) {
  //printf("On device: value=%d\n", elem->value);

  elem->value = 20;
}

void launch(DataElement *elem) {
  Kernel<<< 1, 1 >>>(elem);
  cudaDeviceSynchronize();
}

int main(void)
{
  DataElement *e;
  cudaMallocManaged((void**)&e, sizeof(DataElement));

  e->value = 10;

  launch(e);

  printf("On host: value=%d\n", e->value);

  cudaFree(e);

  cudaDeviceReset();
}