Memory consumption of cudaDeviceSynchronize from kernel

When calling cudaDeviceSynchronize() from a kernel, ~2.2GB of memory is being consumed.

I hope someone are able to explain me what is happening under the hood, and whether this is intentional behavior. I find it odd that simply compiling the kernel but not calling it takes up such a big portion of memory.

Following little example program only reports ~14GB out of the total 16GB available on the Tesla V100.

I am using cuda 11.3 (but had same behavior with 10.2).

I have been looking through the Dynamic Parallelism documentation for any hints to why this happens, but without any luck.

#include <cuda.h>
#include <iostream>

__global__ void kernel()
{
    cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}

int main(int argc, char *argv[])
{
    size_t gpu_free;
    size_t gpu_total;
    cudaMemGetInfo(&gpu_free, &gpu_total);
    std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
    std::cout << gpu_total - gpu_free << std::endl;
    return 0;
}

I compile it with:

nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./test.cu -o test.cu.o

nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl

/usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl  -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib" 

I’ve tested it on a Tesla V100 and GTX 1050Ti.

When I compile and run your code as follows on CUDA 11.3, GTX 960 with 2GB, I get the following output:

$ cat t129.cu
#include <cuda.h>
#include <iostream>

__global__ void kernel()
{
    cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}

int main(int argc, char *argv[])
{
    size_t gpu_free;
    size_t gpu_total;
    cudaMemGetInfo(&gpu_free, &gpu_total);
    std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
    std::cout << gpu_total - gpu_free << std::endl;
    return 0;
}
$ nvcc -rdc=true -o t129 t129.cu -lcudadevrt
$ ./t129
available: 1894187008 - total: 2099052544
204865536
$

That indicates a usage of ~200MB which is in my opinion completely normal. There is no difference in behavior or reported size if I compile with your sequence:

$ nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./t129.cu -o test.cu.o
$ nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl
$ /usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl  -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/lib64"
$ ./test
available: 1894187008 - total: 2099052544
204865536
$

Perhaps you are misreading the output. That’s simply a guess since you haven’t actually shown a trial run of the program on your system.

Thank you for getting back to me.

Interesting that you don’t experience same memory consumption as me.

On my GTX 1050Ti I get:

available: 2312568832 - total: 4233035776
1920466944

On the Tesla V100 I get:

available: 14730919936 - total: 16945512448
2214592512

So roughly 2.2 GB on V100 and a bit less on the GTX 1050Ti.

I also get same output when simply compiling with:

nvcc -rdc=true -o test test.cu -lcudadevrt

I observe what you are reporting on a Tesla V100-32GB

I’m not able to explain it, although it may be that the device runtime memory usage varies based on GPU architecture.

You’re welcome to file a bug if you wish.