Hardware Suggestions Random thoughts and ideas for nVidia

Sorry if this post rambles on a bit, but I had a few hardware-related ideas I wanted to share with nVidia (and see what everyone else thought as well). Here goes…

  1. Bandwidth doubling add-in card: What about producing a small add-on card that would fit into an x16 slot and have an SLI connector? The idea here would be to connect this card to your CUDA-enabled GPU (for example: GTX295), which would then be able to transfer data over SLI to/from the add-on card. The benefit here would be that a single GPU could have access to (theoretically) x32 bandwidth. In some cases, people may have kernels which are highly bandwidth intensive, but perhaps not very compute-intensive (since some algorithms/code don’t parallelize as well as others). By simply throwing in this add-on card, they could effectively double (+/-) their computing speed. Perhaps the hardware engineers could find a way to daisy-chain these things, so that a single CUDA card (again, lets use the GTX295 as an example) could be connected to two of these cards, giving it a theoretical x48 bandwidth.

I’m not a hardware engineer (though I did have some computer engineering classes in school), but I don’t think that this would even be a particularly complicated card to make…just a simple microprocessor to handle communications between the add-on card and the PCIe bus host, and between the add-on card and the GPU. Perhaps maybe throw in a single memory chip to act as a transfer buffer. The cost should be less than $100 for something like this (and that’s new…once they’ve been out for a while, I would think it could even be much less than that).

This would also be useful to non-CUDA folks (i.e. gamers), since I imagine that some games are fairly intensive when it comes to transferring textures back and forth to the graphics card, and having a cheap add-on like this that doubled the transfer bandwidth would be a great way to take a lead over “that other company” (since they were/are already behind in that department).

  1. Mobile Tesla card: I don’t know how far off double-precision support is for ‘normal’ mobile GPU’s, but over the past year or so, I’ve seen a lot of people asking when it will be available. CUDA users in highly technical fields (e.g. engineering, finance) really have to have double-precision support, and thus, CUDA sometimes doesn’t ‘cut it’ for them.

What I’m thinking of is to make a mini PCIe board (picture) with a GPU and some memory, so it would basically be a mobile Tesla card. As I mentioned, putting double precision support in this would make it a killer app, especially if nVidia and ATI aren’t planning to have DP available in their normal mobile GPU’s in the near future. I’d say either take one of the higher-end 9M or GT100 series chips, put 1GB of ram on it, and allow the clock to be dynamically controlled by the user (to save battery power, and keep it from overheating)…and you’d have a bunch of people who would like to upgrade their older laptops, use CUDA in their new slim laptops/netbooks, or even those who have a higher-end mobile GPU already (in a larger notebook) but simply want more mobile CUDA power. For DP support, use the GT200b from the new GTX260-216 and clock it way, way down (to like 20% speed).

If you haven’t noticed…lots of newer laptops/netbooks are coming with multiple PCIe slots for things like WWAN, WLAN, and so forth, and unless you’ve totally maxed out (i.e. purchased every available option) on your new computer, chances are you have an available slot. If not, and you can deal with an external wireless card, you could remove the internal WLAN card to put in the mobile Tesla.

Honestly, the reason I haven’t bought a new laptop lately is because it seems impossible to find a reasonably small one (13.3" or 14.1" screen) with a decent GPU to do CUDA programming on, and demonstrate GPU-accelerated applications. Being able to buy a GT200b-equipped Mobile Tesla and just pop it in one of these things would really be ideal for me, and it seems many others as well.

I think the problem here is that the SLI bridge is not a particularly high bandwidth link. However, now that I look around, it seems to be impossible to figure out what the real bandwidth of the bridge is. Whatever it is, there’s no reason to believe it is as fast as the PCI-Express bus. You might get some bandwidth improvement here, but not necessarily 2x.

Bandwidth again is the killer here. :) The Mini PCI-e interface is x1, and might not even be PCI-Express 2.0 speed yet. That means the host-to-device bandwith will be ~250 MB/sec, which will limit the utility of such a product.

R.e. 1):

The SLI bridge connector is relatively low bandwidth - it’s only designed to transmit digital video between cards.

The fundamental problem is that memory bandwidth is limited by the chip - the width of the memory bus (i.e. the number of pins) and the memory clock - even if you had some external card with additional memory, there would be no way to get the data into the GPU any faster.

Interesting ideas, though!

A mobile CUDA card would be very cool. Particularly if it had driver support such that the Intel Integrated stuff would only be copying a framebuffer :).

PCIe seems like a bunch of standards overhead; PCIe 3.0 according to Wikipedia is only twice as fast as PCIe 2.0. Also, it seems like nVidia could be at the mercy of Intel and AMD if the chipset licensing wars continue (any updates on the core i7 license?).

On-GPU memory seems like the way to go, though another layer of memory hierarchy could be useful, e.g. to use 16GB of DDR2 (which could be plenty fast when removed from CPU motherboard restrictions, and cheap).

Actually, I interpreted the suggestion as a way to double the host-to-device bandwidth by effectively using two PCI-e slots. A pair of PCI-e 2.0 x16 slots could come pretty close to saturating the host memory bus on a Core i7 system.

Ah right, I misunderstood the original suggestion. Still, the same argument applies - current chips only have a single PCI-E interface, so adding another connector wouldn’t help. That said, we’re always looking at new hardware interface technologies, and new features like zero-copy memory are already helping reduce the bandwidth limitation.