How big is minimum assigned size in cudaMalloc?

I am studying CUDA by myself… that’s really hard to me…
I used cudaMalloc API.
For example,
cudaMalloc((void **)&d_test, 10 * sizeof(int));
in this line, I thought d_test uses 40 bytes.
But in practice, it is assigned for about 2MB (2,097,152 bytes)…

I attach about device memory info.

before memory: 2601792307 bytes
after memory: 2599695155 bytes

How big is minimum assigned size in cudaMalloc?

because my data set is for about 12MB in reality.
But when I call the cudaMalloc, it takes bigger size than real size…
My graphic card memory size is 3GB and test data set is 12MB.
but memory is lack… I am panic now…
Please help me…

If you perform another small cudaMalloc() (two successive calls in a row)? does it reuse the same 2MB chunk of device memory? Because that’s what I would expect.

Also the first call to cudaMalloc() might have created the cuda context. That’s a lot of overhead, which also consumes memory on both the host and the device.


Dear cbuchner1

Yes, that’s right. I just check that.

I assign 400,000 bytes (100,000*sizeof(int)) 6 times.
this is a result.

current free memory: 2601792307 bytes
test free memory: 2599695155 bytes
test free memory: 2599695155 bytes
test free memory: 2599695155 bytes
test free memory: 2599695155 bytes
test free memory: 2599695155 bytes
test free memory: 2597598003 bytes

when over 2MB size, system allocate new chunk 2MB… I guessed. but I can not be sure.
because I don’t know how to cuda memory system working exactly…

I have a question. Do you know the C language also assign malloc like this…?

The C language passes of malloc() and free() (as well as new/delete operators) to runtime libraries. These runtime libraries are vendor specific in their implementation details. The runtime usually performs some kind of memory pooling based on allocation size heuristics.

There are even special libraries such as google performance tools that install their own high performance heap implementation when linked against (besides adding useful debugging and profiling features).

At the lowest level the OS will hand out allocations in the process memory space with a granularity that is a multiple of the system’s page size (say 4KB or 16KB). Then the heap implementation is built on top and squeezes multiple small allocations into the same page to maximize the memory utilization. I would think that CUDA does a similar thing, but maybe on larger pages that are natively supported by their hardware.

You mean C language can resolve about allocation size issue using special libraries such as google performance tools.
How about CUDA? It can be also resolved using libraries…?

You can always implement your own memory heap management or pooling schemes if you’re unhappy with what you’re getting from the supported API calls.

I am not aware of any ready-to-use implementations for CUDA.

Google Performance Tools was mostly developed to speed up Linux based memory heap operations a great deal for most everyday workloads.

Interesting. I hadn’t heard of Google Performance Tools for heap management acceleration.

As far as I am aware, the most common reason to install third-party heap managers for CPUs in the past was that many of the ones built into operating systems suffered from a “giant global lock” issue, which hurt performance as average thread count continued to increase while only one thread could be active in the heap at any given time. This was certainly done routinely to boost SPEC CPU benchmark numbers; how many people deployed such third-party managers in production I do not know. One such product supported Windows, Linux, Solaris, HP UX, and IBM AIX.

As far as I am aware, cbuchner1’s description of a multi-tiered allocation mechanism apply to CUDA as well, and you are observing evidence of that. In such multi-tiered schemes, the cost of allocation goes up (sometimes dramatically) the deeper one gets into the layers of allocators.

Two design principles that apply to dynamic memory allocation on both CPUs and GPUs are that high-performance codes (1) should avoid frequent allocation / deallocation operations and (2) are often better off using a user-provided custom allocator based on the specific needs of a particular use case