Since the file I plan to read is very large (20 GB), I use multiple streams and asynchronous memory copies to process the data in batches.
I understand that the issue might be related to the CUDA memory pool, which is why using cudaFreeAsync
for each batch does not release memory back to the OS.
From the NSight Systems timeline, I can see that memory usage never decreases.
How can I disable the memory pool, or is there a smarter way to reclaim memory?
Thanks!
If you process the data in batches, then just reuse the memory instead of allocating new memory for each batch.
If I use multiple streams, I guess I still need to allocate memory for each stream.
I tried this API cudaMemPoolTrimTo
but nothing different.
Not sure, why you would not want to allocate memory for each stream. Streams would typically run in parallel and each would need memory.
So normally you would free the memory only at the end of the overall data processing.
You probably could also allocate one large block once and index into different regions by stream. But I do not see, why that would be better.