I have a big chunk of memory that I want to copy from host to device.
In rare conditions, the device code needs to reallocate this chunk to grow it (a single thread will do this and global synchronization is done afterwards).
Therefore, I decided to use dynamic allocation of global memory using malloc/free.
Unfortunately, I cannot free memory in a device function that was allocated using cudaMalloc, right?
So I need to allocate the memory directly on device.
My question now, how do I initialize this device memory. I tried copying the pointer (created with malloc on device) back to the host and use cudaMemcpy, but the compute-sanitizer gives me:
Program hit cudaErrorInvalidValue (error 1) due to “invalid argument” on CUDA API call to cudaMemcpy
#include <iostream>
#include <cuda_runtime.h>
#define gpuErrchk(ans) \
{ gpuAssert((ans), __FILE__, __LINE__); }
__inline__ void gpuAssert(cudaError_t code, const char *file, int line) {
if (code != cudaSuccess) {
printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
std::cout << std::endl;
}
}
__global__ void dynamicAllocKernel(float** devicePtr, size_t numElements) {
if (threadIdx.x == 0) { // Let only one thread allocate
*devicePtr = (float*)malloc(numElements * sizeof(float));
}
__syncthreads(); // Ensure allocation is done before proceeding
}
int main()
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024); // Set heap size to 128 MB
// Pointer to hold device-side pointer
float** devicePtr;
cudaMalloc(&devicePtr, sizeof(float*));
size_t numElements = 1000; // Number of elements to allocate
// Launch kernel to allocate and initialize memory on device
dynamicAllocKernel<<<1, 1>>>(devicePtr, numElements);
// Copy the device-side pointer to host
float* deviceAllocatedPtrHostSide;
cudaMemcpy(&deviceAllocatedPtrHostSide, devicePtr, sizeof(float*), cudaMemcpyDeviceToHost);
// copy random data from host to device
float* hostArray = new float[numElements];
for (size_t i = 0; i < numElements; i++) {
hostArray[i] = i;
}
cudaMemcpy(deviceAllocatedPtrHostSide, hostArray, numElements * sizeof(float), cudaMemcpyHostToDevice);
cudaFree(deviceAllocatedPtrHostSide); // Free device memory
cudaFree(devicePtr); // Free device pointer
}
Optional question: I there any performance difference for using (not allocating) this memory or does it behave exactly like global memory from cudaMalloc. I’m talking about 4GB-6GB of memory allocated for the heap.