I am trying to use NI-IMAQ and CUDA to do real-time image acquisition(IMAQ) and processing(CUDA), my first trial is like below–
IMAQfunction(&buffer); // grab a image to the buffer by NI-IMAQ function, (CPU)
CUDAfunction(buffer); // process the image bu CUDA, (GPU)
Now the problem is, the above process works fine but it is sequential.
The CUDA processing time + Host/Device I/O + display() together is only 2/3 of the IMAQ acquisition time,
so it will be great to make the CUDA and IMAQ process overlap,
I am thinking about the “Buffer ring” model like below–
buffer[N]; //create a buffer list
Parallel() // The following process can be parallelly processed
IMAQfunction(&buffer[i%N]); // grab a image to the buffer[i%N] by NI-IMAQ function; (CPU)
//here i%N gives the mod
CUDAfunction(buffer[i%N-1]); // process last image buffer[i%N-1] by CUDA (GPU)
I have read something about the Async functions, but it don’t seem to fully solve my simple requirement. Please give me some advice to realize the parallel part
I am reading about the OpenMP stuff, but don’t know if that really helps.
I dont really know about your problem but you should be able to do what you’ve thought of.
The sequence should be:
Read first data.
Open two threads.
Thread A will run the CUDA code on the previous data
Thread B will get the next data.
Block till both threads have completed their work
move the next data to be the input for the CUDA thread
go back to step 3 untill there is no more data to read.
That way you overlap the reading of data and its CUDA processing. The amount of time
you have to wait in step 5 will be the largest between time 3 (time to calculate on the GPU ) and
time 4 (time to acquire the next data)
Yes I guess OpenMP is the easiest, you can also google for pthreads (for linux) or CreateThread (for windows)
I do think you should first try to do things with 2-3 threads in OpenMP (or equiv) before running 1000s of threads on the GPU :)
I’ve been trying something similar with visual studio, and wouldn’t like to comment about threads in linux as I haven’t done it for so long. I found it easiest to start two threads, one for capture and the other for analysis using CreateThread(…). In the capture thread use imgGrab(sid,buffer,1), then ReleaseSemaphore(…), within a loop. In the analysis thread, again in a loop, wait for the semaphore using WaitForSingleObject on the semaphore then do the cuda stuff. As I recall the important issue is the 1 as the last parameter in the imgGrab so that the call returns when the next available frame is transferred, simplifying synchronisation. My application does not copy the data back to the host each frame but I have no trouble analysing the 500fps from a PCIe1429 using a GTX285. You may also keep track of the buffer use when you ReleaseSemaphore and if the gets close to the maximum available suspending drawing to the screen helps!
If this is how you have structured the program then I think this is where your problem is. I have loops in the threads, so
DWORD WINAPI IMAQ_Proc( LPVOID lpParameter)
grab(&IMAQ_buffer); //get data from IMAQ functions
ReleaseSemaphore(hBufferAvailable,1,NULL); //hBufferAvailable is HANDLE and needs global scope
Indeed, you don’t want to be creating new threads for every frame. In CUDA, there’s an overhead to just start up a new CUDA context, and that overhead is something like 100ms if I remember (don’t quote me!)
It’s a lot more efficient if you create one CUDA thread and leave it persistent. Have it wait at a semaphore until the capture card data is ready then launch your kernel, then go back and wait again (for both kernel completion and for the next capture data).
If you tend to have a slow kernel, you can overlap the memory transfer with the CUDA compute by using an asynchronous memory copy with streams. In fact it may work well to have two streams that ping-pong. You get your capture frame data ready, you fire off the copy and kernel execution to stream 0. You wait for the next capture data and now you fire off the copy and kernel execution to stream 1. This alternation smooths the kernel calling overhead issues since now the device has a queue of work to just keep loading from… it’s not waiting on a CPU loop.
There are of course issues if your kernel execution time starts being too long and you can’t do your compute as fast as you get data. That’s a separate issue and is more about your computational needs and not about your scheduling.