Multithreading with CUDA

In order to get some experience with CUDA, I’m working on porting some audio processing functions; for each file to be processed, the data uploaded/processed by the device is only about 40MB. Given that most cards have at least 128MB of RAM (in the mobile versions), if not 512MB or 1GB+, I would like to attempt to speed things up by doing some multithreading.

Would it be possible (and more efficient) to upload the next (or next several) datasets to memory via a separate host thread, and just have the main thread run in a loop, checking for the availability of new data on the device and running the kernel when necessary?

Or, would it be possible to write a normal CUDA program (read data from file, copy to device, run kernel, copy results back to host, save data), and simply launch multiple instances of the program? Since I know the size of the datasets (and can determine the amount of RAM on the device at runtime), I can determine how many concurrent instances can run without running out of device memory.

What is the best way to go about this? Or is it even an issue? Since I will have many (i.e. hundreds or thousands) of files to process, I would like to find a solution that maximizes the efficiency of the whole system (not just the individual kernels).

Can’t do that–you’d be sharing resources between different device contexts, which is a no-no.

My suggestion would be to either use the driver API and then do thread migration to share the context (although I’d test the latency to see if it’s worth it) or just go nuts with a stream and async operations. If you’re storing 40 megs per sample, you could queue a lot up and then get significant overlap between memcpy and kernel execution.

I think I’m going to do something like this:

Main Thread: Monitor status of the device and maintain an array of handles to arrays on the device. As datasets are completed, new files will be read from disk and copied to arrays on the device, and their handles added to the handle array.

Worker Thread: Run in a loop (and/or triggered by an event when new data is added), reading in the first array handle in the global array handle array. Run calculations on this dataset, copy results back to the host, release the device memory where the dataset was stored, and remove the array handle from the global list. Then, loop if any other array handles are present, otherwise stop.

I think this solution should speed things up a bit without getting terribly complicated. Does anyone know approximately how long it takes to copy a “semi-large” (say, 64MB) chunk of data to/from the device from host memory? How much faster is this operation when copying from page-locked host memory?

64M results on a C1060 using PCIe 2.0:

paged, DtoH = 2383.3 MB/s

paged, HtoD = 1944.5 MB/s

pinned, DtoH = 2678.4 MB/s

pinned, HtoD = 3197.5 MB/s

Run bandwith test from the SDK. On my machine the transfer is about 800MB/s pageable and 2,5GB/s pinned/page-locked.

Thanks for the replies all. I’m going to get my main kernel finished ASAP, and then see what I can do to optimize the data loading. If I get any worthwhile optimization out of it, I’ll post back with the results.

Also, in case anyone wants a reference, here are my bandwidth test results:

My machine is a 2.4Ghz Core2Duo, 4GB PC6400 DDR2 RAM, Windows XP x64, EVGA 8800 GT 512MB (PCIe 1.0).