Hi, I’ve run into a problem with running concurrent kernels.
We’re running a long calculation kernels on a series of data sets, and are doing so concurrently.
Our limiting factor is memory - we fill all devices up to capacity, and launch the kernels.
However, there are still plenty of calculations to run. Our kernels are data-dependent in the sense that some are short and some take a longer time to complete. Ideally, we would like to get the results of the kernels that finished, free their allocated memory and use it for other kernels.
With this back-story lets get to the issue - cudaMalloc & cudaFree imply a device synchronization.
Is there a way to utilize the no-longer-needed memory?
I can think of two ways to get around this. One is to keep doing what we do - just wait for all kernels to finish and load the next batch. The other way is making sure all kernels allocate enough space to accommodate and data size. We cannot do this - our largest data sets are in the 500MB range, which means we would not be able to run a lot of concurrent kernels (Avg size 150MB) at all.
Am I missing something?
Is there a way to transfer varying length data to and from a busy device?