'Computations server' application design advice

I’m going to design a basic and pretty simple ‘server’ type application which may help me to maximize calculation performance of my main applications.

I plan this ‘server’ application to run as a separate OS process, accepting calls via inter-process communications capabilities of OS.

This server basically consists of a single infinite loop and waits for any requests. After request was received, it schedules ‘compproc’ (computation procedure) to be performed. Compprocs are usually identified by some type identifier, and may accept one or more arrays of data (e.g. 100 int elements), and produce one or more arrays of data as well. Beside that each compproc is first initialized into a known state, and this state can be adjusted between compproc executions via special ‘state switching’ function. Compproc itself–during its execution–updates this state as well. This means that compproc’s state is persistent between compproc executions. This makes it possible to perform streamed processing.

How is the best way to organize such calculations? I’m thinking of creating an additional ‘computation’ OS thread on the server which will execute scheduled CUDA kernels sequentially. Each kernel will be processing a single compproc. (thus efficiency will depend on how compproc is implemented - but this is not in the question right now).

From the CUDA examples I have concluded that each <<< >>> declaration executes kernel and waits for its execution to finish. Is there a way to know when execution finishes? I.e. perform some pooling, or even getting an OS event? Also, knowing GPU architecture form CUDA manual, I’m expecting to have at least 12 kernels (compprocs) to run simultaneously on 8800 GTS.

Is there a way to parallelize processing in some different way? For example, knowing that each compproc is basically unique, and does not interfere with other compprocs data or state.

It also makes me a bit sad that kernels in CUDA examples are very simple while I do not see a complex way to schedule their execution beside writing host function like runTest() - and that means I will lose parallel execution on GPU’s multiprocessors. <<< >>> is the only way it seems. So, at the moment I won’t be able to decompose my compprocs into alike small kernels for quicker execution.

As you said, the kernel invocation is a blocking operation. You have to implement the signaling to the pool controller manually via semaphores or similar in the host thread that calls the CUDA kernel.

That won’t work. You can only have one kernel in flight. Or did you mean you have 12 blocks?


What the reason of CUDA then? If one kernel == one function… There’s too little applications which can express their problem in a single function, even if that function is divided into several parts. I need several different functions working in parallel, independently.

You can run the kernels sequentially. You don’t need to download intermediate results. So data access is from device memory which is much faster than main memory. Or you run two different kernels on two cards if you don’t need a huge bandwidth between them.

The main goal for CUDA is to provide a computation facility for massively parallel problems. That is if you need to run the same routine on a large dataset. If you don’t have a lot of data, or your problem has a deep pipeline structure, CUDA is not for you (like any other GPU approach).


That’s a pity. And no, I do not want to use two cards to perform a couple of calculations.

I think there is problem with CUDA - not my attitude or my desires.

It was proven that thing I want can be accomplished given a bit different programming architecture.

For example, there are DSP plug-ins (VST plug-ins for pro audio environments) that run on even older graphics cards, giving an enourmous performance for free. And these plug-ins are exactly independently parallel. The problem with their implementation is that they use shaders which are hard to program, and which can use a limited code size.

I wonder now what CUDA developers were thinking about? In almost any environment (be it pro audio, video production) we need independently parallel computations - there is little need in solving single problems. I.e. these calculations are not exactly scientific in their proportions.

Why not just run 12*N parallel functions on processors of multi-processors, even sacrificing individual SIMD architecture of processors, but get independency of calculations? It is still a lot of FLOPS.

But that is not the initial problem description you gave in your first post. The DSP plugins are massiv parallel operations on data. That will work just fine with CUDA. You should be able to do much better routines using CUDA instead of Cg.

Btw, these effect chaines do run sequentially on GPUs to emulate a pipeline.


You should keep in mind that NVIDIA (in this context) is in the business of making graphics chips, and CUDA is a balance between an efficient architecture that does graphics, and also makes general purpose computation possible. The goal of CUDA is not to be the best multithreaded architecture. For that, you should take a look at something like the Niagara series of CPUs that Sun has been working on:


(Although the T1 only has 1 FPU for all 8 cores, they are definitely going in the direction of more.)

Don’t think of CUDA as N independent CPU cores, but closer to N parallel floating point units. SIMD architectures on normal CPUs can operate on 4 or 8 floats at once, but GeForce 8800 GTX can operate on 128 at once. That by itself has the potential to be very useful for some problems. CUDA actually goes one step further, and bundles floating point units together in groups (8 on the current cards) with the ability to communicate within the group as well. This opens up a whole other set of possible scatter-gather algorithms which would be very awkward on a pure SIMD architecture. So maybe the best description of CUDA is “a compromise between a SIMD coprocessor and a multicore CPU.”

For me and many others, CUDA has been a huge benefit, because for the right kind of problem, CUDA has more FLOPS per $ than anything else I’ve seen. For the wrong kind of problem, CUDA is no help, as you have found. It depends on your needs, really.

I do not see how DSP plug-ins can be implemented with CUDA. ‘Server’ approach I’ve described initially can be used to process blocks of signal data, and in fact I kept on my mind a possibility to perform streamed DSP computations which plug-ins do. From your replies I’ve concluded such architecture/approach is impossible to implement with CUDA, at least with some substantial performance boost. (running a single thread on GPU, sequentially, won’t give any difference - it will be more cost-effective to upgrade to 4 core CPU from current 2).

I actually hoped that CUDA could be usable for DSP plug-ins, and I wonder why it is not so (especially considering older shader programs work just fine).

I was not meaning effect chains. Effects are running in parallel - at least, from programmer’s perspective (I do not know how gfx driver handles shader jobs - but it seems they are executed in parallel).

Sure thing - specialized CPUs can be better. But they are not cheap. While CUDA as you say, is ‘cheap GFLOPS’. So, I do not see how my request contradicts Nvidia’s goals. They want a wider market. I can envision they’ll get hundred thousands pro users just by producing a USABLE number cruncher. Science (and “password cracking”) is a bit small market for such solutions - Sun will do better there. Nvidia should direct their product to audio and video production, which has a very wide user base. I do not see any ‘wide’ benefit behind CUDA otherwise (while it’s advertised as being ‘wide’).

So you think this graphics architecture is not usable for video processing :no:

Maybe you should revise your system design.


It is a real-time system design. It is in no way ‘run and wait’ type of calculations. Beside that I would like to see a way to transform shader-like programs to CUDA efficiently. They can be easily converted, of course, but I do not see a way to execute them efficiently considering kernel execution is blocking, and is only internally (dependently) parallel.

Also, it would be ideal if CUDA offered kernel execution inside device functions, and device functions could be executed directly from the host (via an API call, from C/C++ application), without host codeflow blocking, with results returned via some OS event mechanism (including pointers to resulting data). This won’t break much of the CUDA ideology (i.e. no guarantee of order of execution of threads/functions). In the worst case CUDA could run these function calls sequentially.

I agree that a non-blocking API for CUDA would be very handy. When I want non-blocking behavior, I have to spawn a thread to call the blocking CUDA kernels. This would just be a minor annoyance, but the host thread calling CUDA spins at 100% CPU, which would be a problem if I did not have a dual core CPU. (Fortunately I do.)

I can’t really say what NVIDIA’s design constraints were, but I suspect that it had something to do with transistor count. :)

Two things:

  1. You CAN execute multiple functions concurrently with CUDA. All you have to do is include the code for all the functions in one kernel file and use conditionals (based on thread/block ID) to choose where each particular function gets executed. There will be no performance penalty as long as all threads in a given warp execute the same code (i.e. do not diverge). So, CUDA restriction is that you must dedicate at least a warp per function if you want max performance (see Section of the Programming Guide).

  2. Since every OS for which CUDA is supported provides threading, there’s no real reason to have non-blocking kernel invocations. If you want to overlap CPU and GPU work, use a thread for each. Even with a single-core CPU, the CPU thread will be swapped in by the OS since CUDA thread will block (as opposed to going into a busy spin-loop) waiting for the kernel to return. Single-thread programming is quickly becoming antiquated with the proliferation of multi-core CPUs anyway.


What if I do not even know what kind of function is going to be executed? My ‘Server’ will accept cubin files ‘as is’, with some additional header data. I would like to execute functions from these cubin files - I cannot recombine them into a single kernel. Beside that solution you are offering is absolutely awkward from software design perspective. (it is impossible to design efficient software not knowing how it is REALLY going to be executed). CUDA’s existing approach works ONLY for problems formulated as a single iterative function. Everything else is going to be awkward and give unexpected performance results.

Sure, this is possible as well. But some say it raises core load by 100%??? How I can run 10 CUDA threads then? Beside that I’m not saying about a single-thread programming. I even meant fiber multi-thread programming. I just do not want to do things CUDA could do itself (schedule function execution, and invoke OS event on their completion - this will be close to programming using fibers).

Warp aligned conditional divergence only gets you so far, since there doesn’t appear to be a way to synchronize sub-block groups of threads. (After the warps diverge, they must have exactly the same synchronization pattern, which might not be the case for two arbitrary “functions” )

I think it would be great if there was some mechanism to both synchronize in smaller groups and across the entire block…


I think it could be better if Nvidia added to CUDA arbitrary device function calls from host (via device API), and ability to execute kernels from device functions. That way we could get both flexibility of calling separate functions and benefit of massive dependently-parallel calculations.

It could be even better if kernel execution was nested. Since it seems that basically kernel execution is a parallel execution of 10s copies of the same C function (hence Nvidia’s reference to SIMD architecture), each copy receiving its own set of ‘index’ parameters, such nesting could be implemented (but I do not get why Nvidia made everything so hard to understand - those blocks/warps/threads - probably index assignment could be optimized as well, made a bit ‘human’ in appearance).

I’m pretty sure such extensions won’t make much of a pain implementing them.

added: To be a bit more specific about ‘human’ appearence. I would like to formulate functions used as kernels as being single-loop, double-loop, triple-loop, quadruple-loop, etc. I do not see why ‘hardware’ block/warp/thread subdivision was used. Given such formulation, one would not even need to define for() constructs. Loop nesting count could be a meta information for the function, and loop indices could be indexed via hard-coded index array. index[ 0 ] would provide index of the outermost loop, index[ 1 ] is an index of the next loop level, etc. (loop counts - count[ 0 ], count[ 1 ], etc - these are provided as arguments when kernel execution is requested).

However, if function requires additional processing before starting/after the end of each sub-loop, this should be organized as kernel with the following organization: 1) do initial processing 2) process sub kernel 3) do additional processing. So, for simplicity, I think kernel functions could be defined in-line as well, without reference to some symbol like ‘func<<< >>>’.

Since nested kernel execution should block execution of kernel code it was called from, processor executing this code should be ‘released’, and after nested kernel finishes, code execution should be continued at the point following nested kernel part. This way all processors can be utilized, and those used to execute base kernels will be used to execute nested kernels. (at least this is how I understand G80 architecture). So, kernel nesting won’t create any ‘waiting’ processors. It’s just a problem of processor allocation.

And a follow-up question: can ‘warp’ size=32 be defined as G80 architectural choice, or is it compiler/driver choice? I mean, if multiprocessor has 8 processors does that mean each multiprocessor can run 8 different instruction sequences, with each consisting of 4 operand SIMD instructions? I mean, is it possible to lower warp size to 4 so that processors are maximally independent of each other? If not, what is the reason to mention 8 processors in a multiprocessor when they do not have their own ‘instruction pointer’ each?

(sorry if something of that is laughable - just want to be helpful, as I’m very interested in offering software solutions based on CUDA - this is going to be a major leap for me and my customers).

bump ;)

The warp size is a property of the hardware. The GeForce 8800 GTX is composed of 16 multiprocessors. Each multiprocessor has 1 instruction decoder and 8 scalar processors (note they aren’t 4 component vector processors like previous GPUs). All 8 scalar processors must execute the same instruction each clock cycle, but they can operate on different data. Due to the pipelining of the processors, most instructions take 2 clock cycles (like multiply-add). Additionally, the processors run at twice the clock frequency of the multiprocessor. So, to keep everything busy, the warp size is set to 8 x 2 x 2 = 32. Thus, for every 2 clock cycles of the multiprocessor, 32 threads can be serviced.

The individual processors on the G80 are unusual because they are more flexible than a SIMD unit from a normal CPU. They have local registers and can scatter and gather data in ways that something like SSE would not allow. However, they are not independent execution cores with separate instruction pointers. Divergent code paths in threads running on the same multiprocessor will destroy some of the parallelism as the thread scheduler will only be able to run threads on the same code path in parallel. But, if divergence is only at the multiprocessor level, there is no performance penalty, since they all have separate instruction pointers.

One could argue that calling them “processors” is a little misleading, but I’m not sure what you would call them, since they are more than just standard FPUs.