'Computations server' application design advice

Sure, ‘processor’ is a misleading term then - together with ‘thread’ term used throughout CUDA. ‘Arithmetic unit’ is a better term. Processor is an alias of ALU (arithmetic logic unit, and so free instruction pointer branching is a required attribute of a ‘processor’). At least, I think this is a wide-spread understanding. (i.e. there’s no processors without branching, and ALU is usually meant to have some branching instructions following logic instructions).

Anyway, Nvidia is the master here - so, with any requests I can only hope. :)

After a bit of though, I think the best term for current ‘processor’ would be ‘co-processor’, because it basically works as FPU in old x86 processors.

So, now I can perceive 8800 GTS as graphics card with 12 processors each having 8 co-processors.

Whether or not “processor” is the right name for each individual processing unit, we think “multiprocessor” is a good name for each group of processors that execute a warp in parallel.


Fine, what about the main questions raised in this thread?

I think you will get best results by sticking to the data-parallel programming model, since the GPU is a data-parallel processor. Attempting to shoehorn other programming models onto it is counter-productive. Expressing your computations in a data-parallel form also allows them to more easily scale to future hardware.

Work queues and other task-parallel constructs result in non-uniform parallel execution and are therefore problematic on a machine that is effectively SPMD. A work pipeline, on the other hand, could be very efficient on a GPU as long as each stage is highly data parallel.

We are confident that the GPU is a highly capable parallel processor and that a programming model that exposes the hardware for what it is provides savvy programmers with the tools they need to get good performance out of it.


OK, but do you ever plan to allow parallel execution of 12-16 different kernels? This is the only feature I really need (after a bit of further analysis on my side).

Also, kernel invocation overhead could be lowered a bit. (I’ve read post here that CUDA allows something like 50000 kernel invocations per second only - and that’s without computations it seems). So, this may limit my ability to efficiently utilize processing power, by solving a lot of ‘local’ problems which require a separate kernel each.