Mass Processing Limitations Case examples of design concerns in gpu

I am highly concerned that the published design for TESLA will have severe problems related to data access and vector modeling processes despite the pciE interface to other systems.

The dominant concern is that even with a gig of memory per processor, the availability of external data requires transport into the shared memory space.

When analysing diverse sensor data sets, which these GPU units are good for, there is a major problem with getting the data indicated during iterative analysis.

I will fabricate a few examples:

Reverse rendering, assume a bank of gpus generated an image of large size and many others from different vantage points from a static data set. Even with known vectors of camera, the correlating pixel data must be compared to fabricate potential 3d surface locations. When the number of generated images is vast, the first iteration of processing has no known space partitioning of image data. To check the integrity of your rendering method, you must reverse model the scene and identify numerical error between the render and original 3d data. With the number of processes needed an exponent of the render process, fetching psuedo random image data from a storage medium (maxing out pciE) means the processors must wait on data. Design flaw or predictable latency that we have not found specs for?

When rendering a large number of concurrent visualization terminals, such as core distributed video feeds for classroom education, there is high affinity of the requested primitive data elements to be rendered (easily stored in local memory) and thus shared, but the access of user-centric data from a realtime core data repository and more importantly the export of frame data (and sequence compression there of) requires a backplane pipe from the local memory rendered-to that is significantly larger than the bandwith required to render simple 3d elements on a majoritively blank background. This implies that when operating at full speed, maxing out number of terminals renderable, the exit bandwith will prevent a large percentage of frames from ever exiting the system, no less effectively piped through to the appropriate protocol handler and out to wire.

An additional concern, also related to distributed student education and collaboration, is the ability to process their interface images in realtime. We have units that handle imaging analysis at camera frame rate, but for mass collaboration (conference environments) it is faster to pipe video to a centralized analytic processor that also happens to be generating the visualization after processing than distribute the entire thing. Even with multicast of common data and best-case processing (cpu stacks on htt with fiberchannel optical between cluster and mpp units) it is a problem to handle enough throughput to correlate human interface and knowledge management and mass representation without specialized processing. the GPU option handles most of the issue, we are using “many” of them now, but getting data in and out of their isolated memory and to wire efficiently is a major problem.

It appears the current designs also have these shortfallings, as indicated by the design of Tesla and the current chips/boards.

Please advise on the situation and I hope others who have explored these topics will comment to this forum as well.

-Wilfred L. Guerin