Allocating Class members with cudaMallocManaged

I have implemented two types of data structures as follows:
class Set {
CLAUSE *clauses;
int numClauses;
void allocator() {
cudaMallocManaged((void **)&clauses, numClauses * sizeof(CLAUSE));
void set_nClauses(int size) {numClauses = size}
CLAUSE *clause(int idx) {return &clauses[idx]}


class CLAUSE {
int *clause;
int size;
int status;
void allocator() {
cudaMallocManaged((void **)&clauses, size * sizeof(int));
void set_size(int cl_sz) {size = cl_sz}


To use the Set data structure, I allocated memory space like that:

Set *my_set
cudaMallocManaged((void **)&my_set, sizeof(my_set));
double set_bytes = sizeof(my_set) + sizeof(CLAUSE) * 30000;
for (size_t i < 0; i < 30000; i++) {
set_bytes += sizeof(int) * 3;

The above code worked fine and successfully filled the Set structure. Also, I calculated the actual size of all allocated pointers using the set_bytes counter to make sure it doesn’t exceed the device dedicated memory, it turned to be ~1 MB of memory space; however, when I monitored the GPU dedicated memory using windows 10 task manager, about 2 GB of memory space has been consumed which is insanely large compared to the actual data size. Can someone explain what’s really happening here? because I’m running out of GPU memory and I don’t even close to using large data sets. I suspected the task manager first but I increased the Set size to around 100000 clauses, but the program crashed due to insufficient memory space!

Let’s see.

1MB for 30,000 elements, so each element uses ~32 bytes.

If you do 30,000 cudaMallocManaged operations, and each uses the minimum page size allocation, you’re going to need at least 4kbyte * 30,000 = 120MB. I’m not sure the page size is 4kbyte but if you want to know, it should be easy to write a test case to figure it out.

I can’t explain 2GB, but I can explain a number a lot larger than 1MB.

You may want to rethink things. A huge number of really tiny allocations is a bad strategy with GPUs. In the managed allocation case, it means that when you launch a kernel on windows, you’re going to trigger 30,000 individual cudaMemcpy operations (effectively, under the hood). That could take a while and give you dismal transfer performance.

Aha, I see. The page size might explain the memory explode since I’m using a unified memory space. Then, what are the suitable solutions for this situation? should I allocate the memory inside a kernel with malloc/new?


allocate all 30000 elements in a single call to cudaMallocManaged

Thanks txbob, it worked. a single call to allocate all elements at once consumed only about 130 MB which is close to your estimation by the way.

Thanks again.