In order to get some experience with CUDA, I’m working on porting some audio processing functions; for each file to be processed, the data uploaded/processed by the device is only about 40MB. Given that most cards have at least 128MB of RAM (in the mobile versions), if not 512MB or 1GB+, I would like to attempt to speed things up by doing some multithreading.
Would it be possible (and more efficient) to upload the next (or next several) datasets to memory via a separate host thread, and just have the main thread run in a loop, checking for the availability of new data on the device and running the kernel when necessary?
Or, would it be possible to write a normal CUDA program (read data from file, copy to device, run kernel, copy results back to host, save data), and simply launch multiple instances of the program? Since I know the size of the datasets (and can determine the amount of RAM on the device at runtime), I can determine how many concurrent instances can run without running out of device memory.
What is the best way to go about this? Or is it even an issue? Since I will have many (i.e. hundreds or thousands) of files to process, I would like to find a solution that maximizes the efficiency of the whole system (not just the individual kernels).