Profiling CUDA memory consumption

I’m trying to use nvprof to obtain detailed memory consumption data of my application. I have two questions:

  1. Using nvprof or CUPTI, is there a way to obtain the total allocated device memory at a given point in time? nvidia-smi is able to print the amount of total allocated device memory, so there must be some API access to that figure? I know how to profile memory allocations and frees with nvprof, but that doesn’t account for all memory (e.g. it doesn’t account for memory reserved for CUDA heap allocations.)

  2. How can I find out how much local memory a kernel uses, in total and per thread? Both these figures are captured by nvprof’s sqlite output in the CUPTI_ACTIVITY_KIND_KERNEL table. However, this gives me data that I cannot explain:

    • For the code


        void saxpy_with_local_memory(int n, float a, float *x, float *y)


            const int local_memory_size = 1000;

            unsigned char p[local_memory_size];

            int i = blockIdx.x*blockDim.x + threadIdx.x;

            // Do this to prevent compiler optimizations from removing the array.

            p[i%local_memory_size] = i;

            if (i < 0) y[i] += + p[i%local_memory_size];

            if (i < n) y[i] = a*x[i] + y[i];


        int main(void)



            // Perform saxpy_with_local_memory on 1M elements

            int n_threads_per_block = 1024;

            saxpy_with_local_memory<<<(N+n_threads_per_block-1)/n_threads_per_block, n_threads_per_block>>>(N, 2.0f, d_x, d_x);



I get

localMemoryPerThread: 0
localMemoryTotal: 264241152

even though I am allocating 1000 bytes per thread.

If I change local_memory_size to 100000, the profiler seems to give a buggy result:

localMemoryPerThread: 0
localMemoryTotal: -1267466240

How can these results be explained?

My system details:
GPU: Tesla P100-PCIE-16GB
I’m doing all my work in a nvidia/cuda:11.0-devel-ubuntu18.04 docker container.