Unable to allocate more than 2MB using malloc *in CUDA kernel*

Platform is Jetson Nano. I have a single call to malloc in my CUDA kernel which is only called once. This call cannot allocate more than 2MB of memory. It fails at 4MB.

HOWEVER, when I use cudaMalloc from the host code, it can allocate much more than 2MB.

Why does malloc in device code not work for large quantities of memory?

Please see the CUDA Programming Guide section Heap Memory Allocation. The default size size for the device side Malloc Heap is 8MB. The limit may have been reduced on mobile platforms.

If the limit is 8MB and the code continues to fail then please post a minimal viable reproducible.

Here you go:

#include <stdio.h>

__global__ void mallocTest(void)
	int mem = 4096;
	void* test;
		test = malloc(mem);
		if(test == NULL)
			printf("malloc failed at %d bytes\n", mem);
		mem *= 2;


int main()
  1. Save the code as malloc.cu
  2. Run “nvcc malloc.cu”
  3. Run “./a.out”

This is on Jetson Nano. Thanks for the link BTW. It’s a big manual and I hadn’t read that section on heap memory yet as I assumed cudaMalloc would abide by the same rules as device-side malloc.

If I use cudaDeviceGetLimit on the Jetson Nano, it says it is 8388608 so I believe I have found a bug.