Dynamic Heap initialization

I have a big chunk of memory that I want to copy from host to device.
In rare conditions, the device code needs to reallocate this chunk to grow it (a single thread will do this and global synchronization is done afterwards).
Therefore, I decided to use dynamic allocation of global memory using malloc/free.

Unfortunately, I cannot free memory in a device function that was allocated using cudaMalloc, right?
So I need to allocate the memory directly on device.
My question now, how do I initialize this device memory. I tried copying the pointer (created with malloc on device) back to the host and use cudaMemcpy, but the compute-sanitizer gives me:
Program hit cudaErrorInvalidValue (error 1) due to “invalid argument” on CUDA API call to cudaMemcpy

#include <iostream>
#include <cuda_runtime.h>


#define gpuErrchk(ans)                                                         \
  { gpuAssert((ans), __FILE__, __LINE__); }
  
__inline__ void gpuAssert(cudaError_t code, const char *file, int line) {
  if (code != cudaSuccess) {
    printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    std::cout << std::endl;
  }
}

__global__ void dynamicAllocKernel(float** devicePtr, size_t numElements) {
    if (threadIdx.x == 0) { // Let only one thread allocate
        *devicePtr = (float*)malloc(numElements * sizeof(float));
    }
    __syncthreads(); // Ensure allocation is done before proceeding
}


int main()
{

  cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024); // Set heap size to 128 MB

    // Pointer to hold device-side pointer
  float** devicePtr;
  cudaMalloc(&devicePtr, sizeof(float*));

  size_t numElements = 1000; // Number of elements to allocate
  // Launch kernel to allocate and initialize memory on device
  dynamicAllocKernel<<<1, 1>>>(devicePtr, numElements);

  // Copy the device-side pointer to host
  float* deviceAllocatedPtrHostSide;
  cudaMemcpy(&deviceAllocatedPtrHostSide, devicePtr, sizeof(float*), cudaMemcpyDeviceToHost);

  // copy random data from host to device
  float* hostArray = new float[numElements];
  for (size_t i = 0; i < numElements; i++) {
    hostArray[i] = i;
  }
  cudaMemcpy(deviceAllocatedPtrHostSide, hostArray, numElements * sizeof(float), cudaMemcpyHostToDevice);

  cudaFree(deviceAllocatedPtrHostSide); // Free device memory
  cudaFree(devicePtr); // Free device pointer

}

Optional question: I there any performance difference for using (not allocating) this memory or does it behave exactly like global memory from cudaMalloc. I’m talking about 4GB-6GB of memory allocated for the heap.

Instead of answering your direct question, I present some approaches for your need:

Zeroth Approach

Why is it not possible to allocate more memory on the host in the first place?
What does prevent you to make the buffer larger than typically needed, but large enough for all cases, so that no reallocation is necessary?

This would be the simplest case.

However, could this additional memory be possibly needed for something else? But both needs would not occur at the same time?

First Approach

In that case could you store that something else into the end of the too large buffer (in case you want to use the memory for the something else instead of enlarging the buffer)?

Basically you would do some memory management by yourself: Allocating memory for more than one array on the host in one allocation, but deciding what arrays are stored where in the kernels.

That would work well with two or few buffers (something like union), but be more involved for lots of buffers, where either has to be enlarged and you want to keep them continuous.

Second Approach

Another possibility: Use Managed Memory with enough hints so that the memory stays on the device. Hopefully in this case it is as fast as unmanaged memory, but managed memory allows oversubscribing the GPU memory.

The advantage compared to the previous approach would be that you are using the virtual addresses so that memory blocks do not have to be rearranged as could be possible in some scenarios of the previous approach (which can be difficult in a parallel CUDA program during kernel runs, as the other threads have the old pointers, if you do not switch to a new kernel at that time).

With the virtual memory, free memory can be used to extend either of several arrays. They have to be large enough in the first place, but the actual memory is only used as soon as you need the additional space. So there are enough continuous memory addresses for each of the enlarged arrays.

Third Approach

A third way would be to do some virtualization by yourself. Take the first approach, but do not demand that all the arrays have to be continuous. That way you do not need to copy/move blocks in some of the scenarios.

Instead of

int idx:
specific_gmem[idx];

use

int idx:
int highindex = idx / 1048576;
int lowindex = idx % 1048576;
general_gmem[vtable[highindex] | lowindex];

specific_gmem is a single array, whereas general_gmem contains lots of arrays, which alternatively and potentially have to use additional memory.

Just describing the general approach. In practice you could simplify things and hardcode others so that you do not need an access to vtable for each memory access.

E.g. the vtable[highindex] is determined outside of loops or there are only two or a few possibilities, which are decided with booleans.

In a simplified case of your original problem, you would have three arrays: normal_array1, normal_array2 and grow_array and would switch between them depending on the index. grow_array could be used for either.

int idx;
if (idx < normalsize1)
    normal_array1[idx];
else
    if (grown1 && !grown2)
        grow_array[idx - normalsize1];

int idx2;
if (idx2 < normalsize2)
    normal_array2[idx2];
else
    if (grown2 && !grown1)
        grow_array[idx2 - normalsize2];

The difference between that and the general_mem approach is that it uses separately allocated arrays instead of one large array, and that it uses booleans with a few decisions instead of a vtable. But you can mix those approaches. One large array allows more flexibility and you can switch between the sub-arrays just by changing the index (or calculating it differently).

1 Like

Thank you very much for all these possibilties. I’m not sure I fully comprehend all of them, but I think they all boild down to some kind of my own memory management.

The description of my problem here is of course simplified. In general I have 1 rather big array and 100k’s of smaller arrays. All these arrays could grow potentially, but usually not often and not by too much, but some of them might grow more than others. If I simply allocate much more memory evenly for all of them, I’m wasting a lot of memory and run into a memory out very quickly if one of the arrays is full.
I could preallocate the complete RAM by myself and “simply” write my own memory allocator, so I can realloc the size of my arrays as I need. But this is very complicated (keeping track of wholes and reuse old memory etc…) and this is what new/delete usually do for you, so I dont want to implement a worse version myself.
The “heap” seems to be exactly this thing, a “preallocated/reserved” part of the memory which is automatically managed by new/delete.

So:
Zero: Unfortunately no, as I have many arrays and they all might grow differently.
First: Basically my own memory management but way too complex
Second: I’m not too experienced with ManagedMemory, but is it possible to realloc this on device, so inside a kernel call? I try to avoid leaving my kernel for the reallocation, as this would require more synchronization and also produces quite some overhead. Also all shared memory will be lost and would need to be saved and reinstantiated somehow.
Third: Yes, this would be some other kind of memory manager, coming with some overhead. Might not be as complicated as First but still does something that new/delete should do for you.

Thanks a lot for your answers, I really appreciate it. I still would like to try to use the inbuild methods (new/delete) to avoid to write a lot of probably buggy and slow code by myself.

I understand that you want to use the inbuild methods and hope that you get a good reply for it. I have never used dynamic memory management in kernels, and am also awaiting any answer.

So probably having one big array, some medium-sized arrays and 100ks of smaller arrays would also not work? And instead of reallocating you assign one of the medium-sized ones?

The Second Approach would not have to leave the kernel. All arrays are allocated with a large size, but actual memory is only used, if the arrays are accessed. And this is true even for parts of the arrays (with page granularity?). I would just hope that the performance is good enough. Normally managed memory is used to automatically move and copy memory between host and device, and for that, it is quite slow. But in recent versions special commands and hints were introduced, so perhaps for your use case, it could work in a fast way. (Especially if it should not automatically copy or move to and from the host, but if you only want to use the feature that it widens the arrays on demand, without leaving the kernel).

CUDA typically uses 2 MiB page sizes (see Programming Guide), so handling 100k’s of smaller arrays perhaps is a bit much, as the granularity of virtual memory is the page size. Can you reduce the number to 1k? E.g. by combining some arrays?

As you have so many arrays, I would fear that CUDA provided memory allocation functions are slow. Ideally cudaMalloc is only run at the beginning of the program.

I would expect that you end up at a mixture of provided facilities and some manual problem-specific solution.

1 Like

To answer your initial questions: It is not possible to use the device heap in the way you intended.
The device heap has limitations in regards of interoperability with the CUDA runtime API. You cannot free cudaMalloc’d memory from within a kernel, and cannot use in-kernel allocations in runtime API calls. See 1. Introduction — CUDA C Programming Guide

Memory allocated via device malloc() or __nv_aligned_device_malloc() cannot be freed using the runtime (i.e., by calling any of the free memory functions from Device Memory).

Similarly, memory allocated via the runtime (i.e., by calling any of the memory allocation functions from Device Memory) cannot be freed via free().

In addition, memory allocated by a call to malloc() or __nv_aligned_device_malloc() in device code cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).

I don’t know if you already have a working implementation which you are trying to optimize, or if you need to get a working implementation in the first place.
I would suggest to put multiple arrays in a single cudaMalloc buffer and to keep a separate list of array sizes and arrays offsets within the buffer. Then have a kernel which computes the new array sizes. If the size of any array in a buffer needs to be adjusted, compute the new total buffer size and new offset array, allocate a new buffer, and use cub::DeviceMemcpy::Batched to copy the existing array elements from the old buffer to the new buffer. Last, free the old buffer and old offsets / sizes.
Of course, you need to figure out a suitable number of buffers for your use-case but to get things started, you could just use a single buffer.

1 Like

If you give us a few hints, how the buffers are used, we can help more.

The big chunk of memory is copied from host to device. Is this read-only by the device? Or do you edit that memory in-place?

Sometimes buffers have to grow. Are the buffers localized (block-wide) or grid-wide?
How do you synchronize and share the new pointer to the other threads or blocks?

How much memory does your GPU have and how much are you currently using?

Thank you all again for the workarounds.

@striker159 Yes, you are right, apparently I can’t use cudaMemcpy for Heap allocated memory.
So If I want to have a huge heap ( say 90% of the RAM) I could use the remaining 10% to copy data from Host to global (10%) and then run a kernel that copies the values into my heap memory arrays. Rince and repeat up to 9 times and I have my data in the heap. It just does not sound like a sensible idea :D
It will also always leave 10% of the RAM unavailable for my application. So I guess my real new question is:

Is there any hardware/other reason why cudaMalloc/cudaMempcy/free/malloc are not interoperatable? or in other words:
What is the reason of the heap existing in the first place, why do malloc/free in device code do not work on “normal” global memory?

@Curefab To answer a few of your other questions:
I do not have a (completely) running application yet, I’m trying to port an application to the GPU in the hope that it speeds up computation. This is a learning application (but not a neural network) that learns data over time. So it must be able to store more and more information over time. Of course what this kind of information is depends on heuristics/input/etc… and is not known beforehand. So I do not know which parts of the data grow with which speed.

  1. The input to my program is static and read only. It consists of a big (can be Gigabytes) vector A and a lot of small vectors B (have the same size as vector A but scattered into a lot of separate lists). If these buffers wouldn’t need to grow, there wouldn’t be any problem.
    Now it can happen that new information is learnt. Buffer 1 simply increases a little bit (8bytes-~400bytes) and at the same time 2 to 100 of the small vectors grow by 4 byte. So vector B is a different memory layout of vector A, which is needed to efficiently index into the data. Mathematically, there is no prediction possible how the data increases and which parts are affected.

The current implementation restricts itself to one big block with as many threads as possible (1024-1584) so synchronization issues get a bit easier as I have to deal with global memory.

If I grow one of the vectors, the new pointer would need to be shared inside the block/all threads.

I’m developing this on a laptop GPU (4GB RAM, the amount of memory the program is using depends on the input. It can be quite small (<500MB) and up to 8GB is normal for the CPU version of this program. The final program should be able to run on any kind of modern GPU. I’m ok with needing quite modern compute capabilities.

@Curefab The managed memory approach seems interesting. It feels like the very wrong tool for my application but I guess it would be able to achieve something I need with not a lot of effort. Although the 2MB page size is quite high. I guess this means that this is the amount that every vector will grow by automatically, right?

Hi chaosangel,

you do not need the 10% buffer: You can use the zero-copy approach, which directly streams data from CPU RAM to the GPU threads over PCIe. It is performant for one-time loading, but less, if you need the same data several times (without storing by the threads, with storing it is fine), as PCIe is much slower than global device memory.
Advantage is that you can use the large CPU memory, but data has to be physically pinned, so no CPU RAM swapping allowed for that data buffers.

A heap working on all SMs and for Cuda commands on the host would need much more synchronization than a heap working more locally. That probably is the main reason for less interoperability.

You probably need a data structure, which handles all the buffers. Similar to a linked list or a hash-table. Both are not very nice on Cuda, because memory accesses are not coalesced.

You could consider, how to at least load blocks of 32 bytes at the same time (the transaction size), e.g. by having 8 threads of a warp cooperate or using wider data types for memory accesses.

Do neighbouring threads access neighbouring buffers or access just randomly. In the first case, I would try to coalesce those accesses (loading several buffers at once), in the second case I would have 2 to 8 threads cooperate for loading each buffer. The computations can be done by separate single threads again.

In problems like yours typically the memory access pattern is the one deciding the performance.

Hi @Curefab
I’m still trying to find information about zero-copy approach, but you are saying that I can copy data from host to “heap” with this?

Generally my memory access patterns are very random (this is problem specific and I can’t do much about it) but I work in warps which access their specific memory coalesced. So my memory access is good, I just sacrificed in the number of actual parallel computation, which is fine for me and also not my first concern. I first want to be able to run this application with growing vectors before thinking about optimizing it.

Yes, with zero-copy you pass a pointer as parameter to the kernel, but instead of from global device memory, the pointer loads from pinned CPU RAM.

The kernels then can do, what they like with the data, including copy it to the heap.

With modern architectures 32-byte coalescing is especially important. So not the full 32-threads-warp has to stay coalesced. And it can be further optimized, by using memory accesses with 64 bits or 128 bits per thread or by letting threads only cooperate for memory accesses.

Coopeation of 2 threads threadIdx.x == 0 and threadIdx.x == 1:
Cooperatively read two memory buffers. First both threads read two ints for mybuf of thread0, then two ints for mybuf of thread1.

This simple example could be also done simpler by directly reading int2 with each thread.
But the example can be extended to wider data formats. I.e. each of the two threads reads 128 bits (int4 instead of int), so you get the 32 bytes coalesced reads (with cooperation of two threads) and computation separately for each thread.

int mybuf = ...; // which memory address each thread wants to load
int otherbuf = __shfl_sync(3, mybuf, 1 - threadIdx.x, 2); // exchange memory address

int data0 = memory[(threadIdx.x ? otherbuf : mybuf) + threadIdx.x]; // both threads read data intended for thread 0 in a coalesced way
int data1 = memory[(threadIdx.x ? mybuf : otherbuf) + 1 - threadIdx.x]; // both threads read data intended for thread 1 in a coalesced way (order not important in new architectures)
int datax = __shfl_sync(3, threadIdx.x ? data0 : data1, 1 - threadIdx.x, 2); // exchange read data for other thread

int2 data = { threadIdx.x ? data1 : data0, datax }; // each thread assembles its intended data

This code with shuffle instructions uses the shared memory infrastructure (shuffle instructions do), but less than copying data to shared memory and back.

Thank you, this is a nice example. But first things first :D
How can I access this host pointer inside my cuda kernel? I don’t think that I can access host memory, no? Do you have a 3 line sample code or a link for this behavior?

Hi chaosangel,

here the zero-copy section shows a short example (with 11 lines, but simple and repeated for 3 buffers, so should be acceptable on average; you can skip the other sections before the zero-copy; the kernel is the same for all of those)

You have to pin already allocated host memory or allocate “pre-pinned” (not official term) host memory, then get the device pointer for it and can directly access the host memory. I think this was available starting with Kepler (2012).

The administrative API calls for pinning/registering memory can be quite slow and should not be included into benchmarks. Those calls should be done at the beginning of the program. The CPU can write into the pinned memory, after is was pinned/registered. So even, if you want to change the data before later further kernel calls, reuse the same already pinned buffer.

The pinned memory is removed from the physical memory, the operating system can use for swapping and other tasks. So make sure that you have enough RAM left (e.g. keep a few GB or more of your overall CPU RAM unpinned).