I’m trying to use nvprof to obtain detailed memory consumption data of my application. I have two questions:
-
Using nvprof or CUPTI, is there a way to obtain the total allocated device memory at a given point in time?
nvidia-smiis able to print the amount of total allocated device memory, so there must be some API access to that figure? I know how to profile memory allocations and frees with nvprof, but that doesn’t account for all memory (e.g. it doesn’t account for memory reserved for CUDA heap allocations.) -
How can I find out how much local memory a kernel uses, in total and per thread? Both these figures are captured by nvprof’s sqlite output in the
CUPTI_ACTIVITY_KIND_KERNELtable. However, this gives me data that I cannot explain:- For the code
//...
__global__
void saxpy_with_local_memory(int n, float a, float *x, float *y)
{
const int local_memory_size = 1000;
unsigned char p[local_memory_size];
int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do this to prevent compiler optimizations from removing the array.
p[i%local_memory_size] = i;
if (i < 0) y[i] += + p[i%local_memory_size];
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
//...
// Perform saxpy_with_local_memory on 1M elements
int n_threads_per_block = 1024;
saxpy_with_local_memory<<<(N+n_threads_per_block-1)/n_threads_per_block, n_threads_per_block>>>(N, 2.0f, d_x, d_x);
//...
}
I get
localMemoryPerThread: 0
localMemoryTotal: 264241152
even though I am allocating 1000 bytes per thread.
If I change local_memory_size to 100000, the profiler seems to give a buggy result:
localMemoryPerThread: 0
localMemoryTotal: -1267466240
How can these results be explained?
My system details:
GPU: Tesla P100-PCIE-16GB
I’m doing all my work in a nvidia/cuda:11.0-devel-ubuntu18.04 docker container.