Nsight system showing more memory than reality

I am profiling with this command:
nsys profile --trace=cuda,nvtx,cublas-verbose,cusparse-verbose,wddm --cuda-memory-usage=true --gpu-metrics-devices=0 --output=…

My code uses a memory pool created like this:

int deviceIndex = 0;
CHECK_CUDA(cudaGetDevice(&deviceIndex));

cudaMemPoolProps props = {};
props.allocType = cudaMemAllocationTypePinned;
props.handleTypes = cudaMemHandleTypeNone;
props.location.type = cudaMemLocationTypeDevice;
props.location.id = deviceIndex;

CHECK_CUDA(cudaMemPoolCreate(&memPool, &props));

uint64_t threshold = UINT64_MAX;
CHECK_CUDA(cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold));
size_t prewarmBytes = 10ULL * 1024 * 1024 * 1024;
void* pWarmup = nullptr;
CHECK_CUDA(cudaMallocFromPoolAsync(&pWarmup, prewarmBytes, memPool, nullptr));
CHECK_CUDA(cudaFree(pWarmup));
CHECK_CUDA(cudaDeviceSynchronize());

However, the result in the nsight system looks like this:

It shows 18GB because the program consumes 8GB plus the 10GB of the memory pool. However, the committed VRAM is only 10GB:

What is the memory usage measuring exactly? Is this a bug or a feature? I am using nsight system 2025.

@liuyis can you respond to this?

@jose.perez.cano Is it possible to share the report file?

MBI_002.nsys-rep You can find the report there.

Hi @jose.perez.cano , the memory usage and mempool charts in Nsys is drawn by analyzing the memory-related CUDA API calls, such as cudaMallocFromPoolAsync.

Could you claify a bit more how this app works? After allocating 10GB from the mempool, does it free them before allocating another 8GB?

After checking the report deeper and revisiting the code example you shared in the original post, I think I understand the 10GB allocation was freed immidiately, before subsequent allocations were requested from the pool.

I think it’s a bug for how we track the memory allocation/de-allocation - the cudaFree seems not accounted for and the pool’s utilized size was never reduced. I am able to reproduce the issue locally. I also found that if I switch to cudaFreeAsync rather than cudaFree, then the deallocation is accounted for correctly - so the issue seems to be that cudaFree isn’t tracked properly.

I think for now, you can switch to use cudaFreeAsync to deallocate the initial 10GB warm-up memory, to make the tracking more accurate. I will open a bug internally to track the issue for cudaFree.

FYI @skottapalli

We created the memory pool like this:

cudaMemPool_t CreateCudaMemoryPool(size_t prewarmBytes = 0x0, uint64_t threshold = UINT64_MAX)
{
	std::cout << "Creating memory pool...";
	auto start = std::chrono::high_resolution_clock::now();

	cudaMemPool_t memPool;
	int deviceIndex = 0;
	CHECK_CUDA(cudaGetDevice(&deviceIndex));

	cudaMemPoolProps props = {};
	props.allocType = cudaMemAllocationTypePinned;
	props.handleTypes = cudaMemHandleTypeNone;
	props.location.type = cudaMemLocationTypeDevice;
	props.location.id = deviceIndex;

	CHECK_CUDA(cudaMemPoolCreate(&memPool, &props));
	CHECK_CUDA(cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold));

	void* pWarmup = nullptr;
	CHECK_CUDA(cudaMallocFromPoolAsync(&pWarmup, prewarmBytes, memPool, nullptr));
	CHECK_CUDA(cudaFree(pWarmup));
	CHECK_CUDA(cudaDeviceSynchronize());

	auto elapsed = std::chrono::duration<double, std::milli>(std::chrono::high_resolution_clock::now() - start).count();
	std::cout << "done in " << elapsed << " ms" << std::endl;

	return memPool;
}

The reason for allocating a big chunk and immediately freeing it was to have a big contiguous chunk of memory already available. We were afraid that if we let the memory pool grow organically it will lead to fragmentation since we don’t know what type of allocator is used under the hood. Since we know beforehand how much VRAM is needed, we simply allocated more than that to make sure that subsequent allocations all reside in that contiguous space. Maybe that does not make sense in the first place if the memory pool is using virtual address spaces and paging, I don’t know.

We are also encountering that on some machines (laptop RTX 5060Ti) sometimes it gives out of memory even though the task manager shows the GPU is completely free. When restarting the laptop it goes away so it is very difficult to reproduce. We suspect it has to do with not destroying the memory pool. The operating system may not be returning that memory properly if the process is killed before doing a graceful shutdown. If you have any ideas on what could be happening, it is greatly appreciated. Although I am aware that this other issue is almost impossible for you to reproduce but just in case you know something.

I think the CUDA Programming and Performance - NVIDIA Developer Forums subforum might have more expertise on this. Here we are more limited to Nsight Systems specific issues.