Streams & Malloc/Free

Hi, I’ve run into a problem with running concurrent kernels.
We’re running a long calculation kernels on a series of data sets, and are doing so concurrently.
Our limiting factor is memory - we fill all devices up to capacity, and launch the kernels.
However, there are still plenty of calculations to run. Our kernels are data-dependent in the sense that some are short and some take a longer time to complete. Ideally, we would like to get the results of the kernels that finished, free their allocated memory and use it for other kernels.

With this back-story lets get to the issue - cudaMalloc & cudaFree imply a device synchronization.
Is there a way to utilize the no-longer-needed memory?

I can think of two ways to get around this. One is to keep doing what we do - just wait for all kernels to finish and load the next batch. The other way is making sure all kernels allocate enough space to accommodate and data size. We cannot do this - our largest data sets are in the 500MB range, which means we would not be able to run a lot of concurrent kernels (Avg size 150MB) at all.

Am I missing something?
Is there a way to transfer varying length data to and from a busy device?

You could allocate most of memory into a single allocation, then do “memory management” yourself.
Effectively doing buffer re-use. This could have several advantages, in particular the overhead to “allocate” should be much lower if you do it, and furthermore you can intelligently avoid fragmentation.
And of course, you can do cudaMemcpyAsync to transfer data to and from a busy device, without interrupting anything.

to perhaps extend on txbob’s post that hardly needs extension:

you could also demarcate and assign data blocks within your total/ global allocation, according to the average block size, and multiples thereof
thus, if the average block size is 150MB, a kernel requiring 140MB would consume 1 block, a kernel with 240MB 2 blocks, and a kernel with 500MB 4 blocks

also, if you know the data requirements of the kernels beforehand, you could perhaps interleave block-heavy and block-light kernels to attain a better mix/ blend and thus occupancy overall