I’m rather new to CUDA and I was wondering I could receive some pointers relating to GPU memory allocation.
I have this very simple testing program here:
struct heh
{
unsigned char *a;
unsigned int b;
heh(unsigned int _b);
~heh();
};
heh::heh(unsigned int _b)
{
b = _b;
cudaError_t err = cudaMallocManaged(&a, b);
if (err != cudaSuccess)
printf("Error: %s\n", cudaGetErrorString(err));
}
heh::~heh()
{
cudaFree(a);
}
int main()
{
srand(5);
vector<heh*> neato;
unsigned int amount = 10000;
unsigned int size = 1;
for (unsigned int a = 0; a < amount; a++)
{
heh *mem = new heh(size);
for (unsigned int b = 0; b < size; b++)
mem->a[b] = rand() % 256;
neato.push_back(mem);
}
unsigned long tot = 0;
for (unsigned int q = 0; q < 10000; q++)
{
tot = q;
for (unsigned int a = 0; a < amount; a++)
for (unsigned int b = 0; b < size; b++)
tot += neato[a]->a[b];
}
printf("Wow %ld\n", tot);
for (unsigned int a = 0; a < amount; a++)
delete neato[a];
cudaDeviceReset();
return 0;
}
And I run it via nvprof. I run it two ways. The first being by setting size to 10000 and amount to 1, and the other by reversing the numbers. It’s to be expected that there’s more overhead with 10k objects 1 byte large as opposed to 1 ~10kb object. What I don’t understand however is that when I run it with 10k 1 byte objects task manager / visual studio will say my GPU is using ~800mb of memory. Additionally, nvprof will say the program used about 40mb of memory:
C:\Users\Syerjchep\source\repos\MyCuda\x64\Debug>nvprof ./MyCuda.exe
==15468== NVPROF is profiling process 15468, command: ./MyCuda.exe
Wow 1282703
==15468== Profiling application: ./MyCuda.exe
==15468== Warning: Found 49 invalid records in the result.
==15468== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==15468== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 57.46% 765.90ms 10000 76.589us 16.950us 167.90ms cudaMallocManaged
39.69% 529.00ms 10000 52.899us 28.347us 1.4556ms cudaFree
2.79% 37.125ms 1 37.125ms 37.125ms 37.125ms cudaDeviceReset
0.05% 654.03us 45 14.533us 292ns 318.25us cuDeviceGetAttribute
0.01% 163.65us 1 163.65us 163.65us 163.65us cuDeviceGetName
0.00% 8.7670us 1 8.7670us 8.7670us 8.7670us cuDeviceTotalMem
0.00% 2.6300us 3 876ns 292ns 2.0460us cuDeviceGetCount
0.00% 1.4610us 2 730ns 292ns 1.1690us cuDeviceGet
==15468== Unified Memory profiling result:
Device "GeForce GTX 980 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
10000 4.0000KB 4.0000KB 4.0000KB 39.06250MB 11.00649ms Device To Host
C:\Users\Syerjchep\source\repos\MyCuda\x64\Debug>PAUSE
Press any key to continue . . .
Not only is the discrepency between nvprof and my other diagnostics odd, but this means that each one of those objects is using between 4kb and 80kb of memory to store one byte of data. Is this amount of overhead normal?
(It should be noted that RAM usage is minimal and that if I set amount higher the program tends to just run out of GPU memory and crash.)