Allocating Class members with cudaMallocManaged

I have implemented two types of data structures as follows:
class Set {
CLAUSE *clauses;
int numClauses;
public:
void allocator() {
cudaMallocManaged((void **)&clauses, numClauses * sizeof(CLAUSE));
}
void set_nClauses(int size) {numClauses = size}
CLAUSE *clause(int idx) {return &clauses[idx]}

}

class CLAUSE {
int *clause;
int size;
int status;
public:
void allocator() {
cudaMallocManaged((void **)&clauses, size * sizeof(int));
}
void set_size(int cl_sz) {size = cl_sz}

}

To use the Set data structure, I allocated memory space like that:

Set *my_set
cudaMallocManaged((void **)&my_set, sizeof(my_set));
my_set->set_nClauses(30000);
my_set->allocator();
double set_bytes = sizeof(my_set) + sizeof(CLAUSE) * 30000;
for (size_t i < 0; i < 30000; i++) {
my_set->clause(i)->set_size(3);
my_set->clause(i)->allocator();
set_bytes += sizeof(int) * 3;
}

The above code worked fine and successfully filled the Set structure. Also, I calculated the actual size of all allocated pointers using the set_bytes counter to make sure it doesn’t exceed the device dedicated memory, it turned to be ~1 MB of memory space; however, when I monitored the GPU dedicated memory using windows 10 task manager, about 2 GB of memory space has been consumed which is insanely large compared to the actual data size. Can someone explain what’s really happening here? because I’m running out of GPU memory and I don’t even close to using large data sets. I suspected the task manager first but I increased the Set size to around 100000 clauses, but the program crashed due to insufficient memory space!

Let’s see.

1MB for 30,000 elements, so each element uses ~32 bytes.

If you do 30,000 cudaMallocManaged operations, and each uses the minimum page size allocation, you’re going to need at least 4kbyte * 30,000 = 120MB. I’m not sure the page size is 4kbyte but if you want to know, it should be easy to write a test case to figure it out.

I can’t explain 2GB, but I can explain a number a lot larger than 1MB.

You may want to rethink things. A huge number of really tiny allocations is a bad strategy with GPUs. In the managed allocation case, it means that when you launch a kernel on windows, you’re going to trigger 30,000 individual cudaMemcpy operations (effectively, under the hood). That could take a while and give you dismal transfer performance.

Aha, I see. The page size might explain the memory explode since I’m using a unified memory space. Then, what are the suitable solutions for this situation? should I allocate the memory inside a kernel with malloc/new?

no

allocate all 30000 elements in a single call to cudaMallocManaged

Thanks txbob, it worked. a single call to allocate all elements at once consumed only about 130 MB which is close to your estimation by the way.

Thanks again.