Tesla Concerns & mass processing Concerns for data throughput management

WilfredGuerin · July 19, 2007, 12:29am

I am highly concerned that the published design for TESLA will have severe problems related to data access and vector modeling processes despite the pciE interface to other systems.

The dominant concern is that even with a gig of memory per processor, the availability of external data requires transport into the shared memory space.

When analysing diverse sensor data sets, which these GPU units are good for, there is a major problem with getting the data indicated during iterative analysis.

I will fabricate a few examples:

Reverse rendering, assume a bank of gpus generated an image of large size and many others from different vantage points from a static data set. Even with known vectors of camera, the correlating pixel data must be compared to fabricate potential 3d surface locations. When the number of generated images is vast, the first iteration of processing has no known space partitioning of image data. To check the integrity of your rendering method, you must reverse model the scene and identify numerical error between the render and original 3d data. With the number of processes needed an exponent of the render process, fetching psuedo random image data from a storage medium (maxing out pciE) means the processors must wait on data. Design flaw or predictable latency that we have not found specs for?

When rendering a large number of concurrent visualization terminals, such as core distributed video feeds for classroom education, there is high affinity of the requested primitive data elements to be rendered (easily stored in local memory) and thus shared, but the access of user-centric data from a realtime core data repository and more importantly the export of frame data (and sequence compression there of) requires a backplane pipe from the local memory rendered-to that is significantly larger than the bandwith required to render simple 3d elements on a majoritively blank background. This implies that when operating at full speed, maxing out number of terminals renderable, the exit bandwith will prevent a large percentage of frames from ever exiting the system, no less effectively piped through to the appropriate protocol handler and out to wire.

An additional concern, also related to distributed student education and collaboration, is the ability to process their interface images in realtime. We have units that handle imaging analysis at camera frame rate, but for mass collaboration (conference environments) it is faster to pipe video to a centralized analytic processor that also happens to be generating the visualization after processing than distribute the entire thing. Even with multicast of common data and best-case processing (cpu stacks on htt with fiberchannel optical between cluster and mpp units) it is a problem to handle enough throughput to correlate human interface and knowledge management and mass representation without specialized processing. the GPU option handles most of the issue, we are using “many” of them now, but getting data in and out of their isolated memory and to wire efficiently is a major problem.

It appears the current designs also have these shortfallings, as indicated by the design of Tesla and the current chips/boards.

Please advise on the situation and I hope others who have explored these topics will comment to this forum as well.

-Wilfred L. Guerin
WilfredGuerin@gmail.com

Jeff_hagen · July 19, 2007, 5:23pm

I was under the impression from other forum posts that an upcoming release of CUDA would allow data transfer while a kernel is running. This ability, in conjunction with locks in the 1.1 compute capability cards would solve this to the extent that all bandwidth of pciE would be available without stalling the processing elements.

When rendering a large number of concurrent visualization terminals, such as core distributed video feeds for classroom education, there is high affinity of the requested primitive data elements to be rendered (easily stored in local memory) and thus shared, but the access of user-centric data from a realtime core data repository and more importantly the export of frame data (and sequence compression there of) requires a backplane pipe from the local memory rendered-to that is significantly larger than the bandwith required to render simple 3d elements on a majoritively blank background. This implies that when operating at full speed, maxing out number of terminals renderable, the exit bandwith will prevent a large percentage of frames from ever exiting the system, no less effectively piped through to the appropriate protocol handler and out to wire.

[snapback]224718[/snapback]

Memory bandwidth has always been the major problem with any kind of SIMD system. SIMD gives you a higher computational bang for the buck, with the expense of a complex programming model and slower memory accesses per processing element.

Tesla does not appear to be designed to do large numbers of concurrent renders. This problem would be better handled with smaller (cheaper) rendering cards at the client side, especially in a classroom environment where final-quality renders would not be required in the normal case.

I’m not sure what you mean here. By realtime do you mean “hard realtime” as the term is used in embedded computing (like aircraft control surfaces) or “soft realtime” as the term is used in multimedia computing.

Soft realtime is pretty easy, and can be done with the current toolset with some creative programming. Hard realtime requires at the very least a realtime OS and driver model. I highly doubt that the nVidia driver set is (or will ever be) realtime…

WilfredGuerin · July 20, 2007, 12:16pm

Thank you for the response, Jeff.

The memory control should be a seperate circuit, so it is plausible that the processor would need limited control over it, a go-ahead message at most. will review.

I was originally going to advise a test case that used xterm or vnc to distribute frame buffers (max possible) because the data management process is about the same weight as the schedueling engine needed to populate the processing ques. With today’s news on “Chinook”, I would suggest not only this, but load the basic graphics render (or even text html form) and use an external mirror of their final-set database to test a very simple and quick process management when … it is basicly an exact stereotype case for this type of processing array.

MORESO, the chinook model should be identical for both local (graphics card) and more advanced systems, and an implementation would give precident to multi-user game programming education.

(No less chinook is literally a 1989 machine with few upgrades.)

Either way, vnc terminal server is within the processing capabilities of these chips, and would be expected as an oversight and management interface for processing anyways when handling multiple tasks.

Thoughts?

-Wilfred

WilfredGuerin · July 21, 2007, 12:50am

Cross reference to technical design comments:

[url=“http://www.gpgpu.org/forums/viewtopic.php?t=4598”]http://www.gpgpu.org/forums/viewtopic.php?t=4598[/url]