Memory consumption of cudaDeviceSynchronize from kernel

lasse.thostrup · May 3, 2021, 9:12am

When calling cudaDeviceSynchronize() from a kernel, ~2.2GB of memory is being consumed.

I hope someone are able to explain me what is happening under the hood, and whether this is intentional behavior. I find it odd that simply compiling the kernel but not calling it takes up such a big portion of memory.

Following little example program only reports ~14GB out of the total 16GB available on the Tesla V100.

I am using cuda 11.3 (but had same behavior with 10.2).

I have been looking through the Dynamic Parallelism documentation for any hints to why this happens, but without any luck.

#include <cuda.h>
#include <iostream>

__global__ void kernel()
{
    cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}

int main(int argc, char *argv[])
{
    size_t gpu_free;
    size_t gpu_total;
    cudaMemGetInfo(&gpu_free, &gpu_total);
    std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
    std::cout << gpu_total - gpu_free << std::endl;
    return 0;
}

I compile it with:

nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./test.cu -o test.cu.o

nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl

/usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl  -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib"

I’ve tested it on a Tesla V100 and GTX 1050Ti.

Robert_Crovella · May 3, 2021, 2:22pm

When I compile and run your code as follows on CUDA 11.3, GTX 960 with 2GB, I get the following output:

$ cat t129.cu
#include <cuda.h>
#include <iostream>

__global__ void kernel()
{
    cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}

int main(int argc, char *argv[])
{
    size_t gpu_free;
    size_t gpu_total;
    cudaMemGetInfo(&gpu_free, &gpu_total);
    std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
    std::cout << gpu_total - gpu_free << std::endl;
    return 0;
}
$ nvcc -rdc=true -o t129 t129.cu -lcudadevrt
$ ./t129
available: 1894187008 - total: 2099052544
204865536
$

That indicates a usage of ~200MB which is in my opinion completely normal. There is no difference in behavior or reported size if I compile with your sequence:

$ nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./t129.cu -o test.cu.o
$ nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl
$ /usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl  -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/lib64"
$ ./test
available: 1894187008 - total: 2099052544
204865536
$

Perhaps you are misreading the output. That’s simply a guess since you haven’t actually shown a trial run of the program on your system.

lasse.thostrup · May 3, 2021, 2:37pm

Thank you for getting back to me.

Interesting that you don’t experience same memory consumption as me.

On my GTX 1050Ti I get:

available: 2312568832 - total: 4233035776
1920466944

On the Tesla V100 I get:

available: 14730919936 - total: 16945512448
2214592512

So roughly 2.2 GB on V100 and a bit less on the GTX 1050Ti.

I also get same output when simply compiling with:

nvcc -rdc=true -o test test.cu -lcudadevrt

Robert_Crovella · May 3, 2021, 3:16pm

I observe what you are reporting on a Tesla V100-32GB

I’m not able to explain it, although it may be that the device runtime memory usage varies based on GPU architecture.

You’re welcome to file a bug if you wish.

Topic		Replies	Views
cudaDeviceSynchronize() returns cudaErrorMemoryAllocation CUDA Programming and Performance	1	552	February 2, 2018
cudaDeviceSynchronize is very slow CUDA Programming and Performance	1	2149	July 31, 2014
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	1051	August 7, 2016
Cuda 11.4: CUDA Programming and Performance	5	318	November 5, 2023
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	412	April 12, 2022
cudaThreadSynchronize() and timing question CUDA Programming and Performance	7	8330	October 27, 2010
cudaDeviceSynchronize() can cause program hanging CUDA Programming and Performance	0	693	February 15, 2021
cudaThreadSynchronize vs. cudaDeviceSynchronize what is the difference? CUDA Programming and Performance	6	24563	June 26, 2011
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1938	October 26, 2020
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4400	September 22, 2009

Memory consumption of cudaDeviceSynchronize from kernel

Related topics