Cuda-memcheck seems to ignore memory leaks

According to the documentation,

Memory leaks are device-side allocations that have not been freed by the time the context is destroyed. The memcheck tool tracks device memory allocations created using the CUDA driver or runtime APIs. Starting in CUDA 5, allocations that are created dynamically on the device heap by calling malloc() inside a kernel are also tracked.

I am using CUDA 10.1, but cuda-memcheck seems to ignore both of these cases. Here is my code and memcheck output:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
$ cuda-memcheck --version
CUDA-MEMCHECK version 10.1.105 ID:(46)
$ cat tmp.cu
#include <cassert>

__global__ void kernel()
{
    auto a = malloc(sizeof(int) * 1000);
}

int main()
{
    void *p;
    auto err = cudaMalloc(&p, sizeof(int) * 1000);
    assert(err == cudaSuccess);

    kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
}
$ nvcc tmp.cu -O0 -o a.out
$ ./a.out
$ echo $?
0
$ cuda-memcheck --leak-check full ./a.out
========= CUDA-MEMCHECK
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors
$

I was expecting it to report two leaks: one inside the main function and one inside the kernel. Am I missing something?

Memory leaks are reported at context destruction time. See the documentation: https://docs.nvidia.com/cuda/cuda-memcheck/#leak-checking

For an accurate leak checking summary to be generated, the application’s CUDA context must be destroyed at the end. This can be done explicitly by calling cuCtxDestroy() in applications using the CUDA driver API, or by calling cudaDeviceReset() in applications programmed against the CUDA run time API.