I’m going to design a basic and pretty simple ‘server’ type application which may help me to maximize calculation performance of my main applications.
I plan this ‘server’ application to run as a separate OS process, accepting calls via inter-process communications capabilities of OS.
This server basically consists of a single infinite loop and waits for any requests. After request was received, it schedules ‘compproc’ (computation procedure) to be performed. Compprocs are usually identified by some type identifier, and may accept one or more arrays of data (e.g. 100 int elements), and produce one or more arrays of data as well. Beside that each compproc is first initialized into a known state, and this state can be adjusted between compproc executions via special ‘state switching’ function. Compproc itself–during its execution–updates this state as well. This means that compproc’s state is persistent between compproc executions. This makes it possible to perform streamed processing.
How is the best way to organize such calculations? I’m thinking of creating an additional ‘computation’ OS thread on the server which will execute scheduled CUDA kernels sequentially. Each kernel will be processing a single compproc. (thus efficiency will depend on how compproc is implemented - but this is not in the question right now).
From the CUDA examples I have concluded that each <<< >>> declaration executes kernel and waits for its execution to finish. Is there a way to know when execution finishes? I.e. perform some pooling, or even getting an OS event? Also, knowing GPU architecture form CUDA manual, I’m expecting to have at least 12 kernels (compprocs) to run simultaneously on 8800 GTS.
Is there a way to parallelize processing in some different way? For example, knowing that each compproc is basically unique, and does not interfere with other compprocs data or state.
It also makes me a bit sad that kernels in CUDA examples are very simple while I do not see a complex way to schedule their execution beside writing host function like runTest() - and that means I will lose parallel execution on GPU’s multiprocessors. <<< >>> is the only way it seems. So, at the moment I won’t be able to decompose my compprocs into alike small kernels for quicker execution.