When calling cudaDeviceSynchronize() from a kernel, ~2.2GB of memory is being consumed.
I hope someone are able to explain me what is happening under the hood, and whether this is intentional behavior. I find it odd that simply compiling the kernel but not calling it takes up such a big portion of memory.
Following little example program only reports ~14GB out of the total 16GB available on the Tesla V100.
I am using cuda 11.3 (but had same behavior with 10.2).
I have been looking through the Dynamic Parallelism documentation for any hints to why this happens, but without any luck.
#include <cuda.h>
#include <iostream>
__global__ void kernel()
{
cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}
int main(int argc, char *argv[])
{
size_t gpu_free;
size_t gpu_total;
cudaMemGetInfo(&gpu_free, &gpu_total);
std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
std::cout << gpu_total - gpu_free << std::endl;
return 0;
}
I compile it with:
nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./test.cu -o test.cu.o
nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl
/usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib"
I’ve tested it on a Tesla V100 and GTX 1050Ti.
When I compile and run your code as follows on CUDA 11.3, GTX 960 with 2GB, I get the following output:
$ cat t129.cu
#include <cuda.h>
#include <iostream>
__global__ void kernel()
{
cudaDeviceSynchronize(); //Synchronizes on work launched from thread's own block only
}
int main(int argc, char *argv[])
{
size_t gpu_free;
size_t gpu_total;
cudaMemGetInfo(&gpu_free, &gpu_total);
std::cout << "available: " << gpu_free << " - total: " << gpu_total << std::endl;
std::cout << gpu_total - gpu_free << std::endl;
return 0;
}
$ nvcc -rdc=true -o t129 t129.cu -lcudadevrt
$ ./t129
available: 1894187008 - total: 2099052544
204865536
$
That indicates a usage of ~200MB which is in my opinion completely normal. There is no difference in behavior or reported size if I compile with your sequence:
$ nvcc -MD -MT src/gputests/test.cu.o -MF test.cu.o.d -x cu -dc ./t129.cu -o test.cu.o
$ nvcc -Xcompiler=-fPIC -shared -dlink ./test.cu.o -o ./cmake_device_link.o -lcudadevrt -lcudart_static -lrt -lpthread -ldl
$ /usr/bin/g++ ./test.cu.o ./cmake_device_link.o -o ./test -lcudadevrt -lcudart_static -lrt -lpthread -ldl -L"/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda-11.3/lib64"
$ ./test
available: 1894187008 - total: 2099052544
204865536
$
Perhaps you are misreading the output. That’s simply a guess since you haven’t actually shown a trial run of the program on your system.
Thank you for getting back to me.
Interesting that you don’t experience same memory consumption as me.
On my GTX 1050Ti I get:
available: 2312568832 - total: 4233035776
1920466944
On the Tesla V100 I get:
available: 14730919936 - total: 16945512448
2214592512
So roughly 2.2 GB on V100 and a bit less on the GTX 1050Ti.
I also get same output when simply compiling with:
nvcc -rdc=true -o test test.cu -lcudadevrt
I observe what you are reporting on a Tesla V100-32GB
I’m not able to explain it, although it may be that the device runtime memory usage varies based on GPU architecture.
You’re welcome to file a bug if you wish.