Tesla S1070 server hardware trying to process 100s of GBs of data

I’ve been pouring over documentation on the Tesla S1070 and code samples for CUDA for the past week. I’m trying to put together a server system that can leverage the power of a a Tesla S1070 with an application written with CUDA. This application will have to churn through 100s of GB of data per day. Each day that amount will increase by 100GB until it reaches a maximum size of 25TB. What we’re unable to ascertain is how the Tesla S1070 will or will not be limited by the host server’s CPU and storage system. I know that it’s recommended that we supply a single CPU core for each GPU. What I don’t understand is why.

In trying to allow the S1070 to do it’s work against 100GB - 25TB of data, where should we be looking for bottlenecks?


This is to be able to start a new kernel a soon as the last one is finished. If you handle 2 GPUs with 1 CPU, you might introduce some extra latency in starting the next kernel. If this is a problem depends mostly on the running time of a single kernel call.

BTW, it is 1 CPU core per GPU, so with a quadcore you need only 1 CPU per tesla.

I think you will find your bottleneck either in getting the data to the PC connected to the Tesla, or in the bandwidth between PC & Tesla.

This is not an easy question, and its answer has to do with and only with the specifics of your algorithm.

For starters, is I/O a bottleneck when your application runs on the CPU?

When CUDA accelerates the processing part of your algorithm by an order of magnitude, it really moves the strain on the other subsystems (meaning disk and network). If you are constrained already, then it will grow into a critical issue.

To answer the question in more detail, you will have to analyze your algorithm’s access pattern. If your algorithm simply loads a small part of the dataset (say, 4GB) into GPU RAM, works on it, then unloads it, then processing 25TB/day translates simply to 290MB/s. This will require a robust RAID array of fast drives (and an interconnect faster than 1-Gig Ethernet), but is manageable. If the access pattern is less ideal, you may have problems.

This is not an important factor, but with the abundance of multicore CPUs and CUDA’s requirement for code to be (mostly) single-threaded, this is a logical suggestion. (This was more a concern in pre-2.1 versions of CUDA.)

Feel free to contact me by email if you’d like to discuss specifics.

Even with the CU_CTX_SCHED_YIELD bug fixed, having one CPU core per GPU is important for CUDA just to reduce latency. Plus, you can always partition work between CPU and GPU to maximize throughput.

Next question:

Will CUDA, or for that matter the Tesla S1070, run on Windows Server 2003 x64? I’ve found some old post saying “yes” because Windows XP64 uses the same kernel. They were old, and it’s late so nobody is going to pick up the phone.

Any experts know the answer to this one?


CUDA works on x64 just fine.
However, for S1070 there’s no Windows drivers released yet. And not official ETA, AFAIK.


So that means we’d have to go with a server that can take 4 C1060 cards …

Alex’s post is most informative and you really need to address his points. You’re jumping to huge conclusions, since putting so much GPU power into one box would strain the PC’s own limitations immensely.

Alex is completely correct. WHAT IS YOUR BOTTLENECK? Is it disk speed? Network speed? RAM? Raw FLOP processing?

WHAT IS YOUR APPLICATION? If you’re doing something like scanning emails for keywords, GPU power isn’t likely your problem, that’s all about data bandwidth over disk and bus pipes.

If you’re processing seismic data from sensor runs, you may indeed have a processing bottleneck, but perhaps you could partition the data to use multiple simpler machines.

Cramming 4 C1060 cards in one machine is awesome but could easily be throwing your effort and money into solving the wrong problem.


I feel like I just got kicked in the stomach :-) I am not the programmer for this application so I may not be able to give you all the details. It’s worth mentioning that the project has changed a little but we’re still after the Tesla.

Instead of processing 100GB files once per day, we’re going to be listening to the data stream across our network. Within that data will be keywords and numbers we need to look for. Certain keywords and numbers will trigger an application that is going be running on the host system. Once triggered, we want the application (written in CUDA) to leverage the processing power of the Tesla GPUs. It will run various algorithms on it and trigger that same application to send data back across the network. The data it sends back will be small 32-64k.

So now we’re listening to data come across a network at no greater speeds than 1Gbe, processing certain parts, and sending it back. To me, it sounds like the stress moves to the NIC which means we need to have free CPU cycles to handle it. As for handing it off to the C1060 (we’re thinking 2 per box, 2 boxes) I assume it’s going to be a PCI bus issue from NIC → RAM → GPU.

Does that sound better?

And if Tesla is the solution, you can buy a prebuilt workstation with the 4 card configuration:


It will run you about $8.5k. (Tower configuration, not rackmount.)

If the stress moves to the NIC, then that means the CPU might be often free, waiting for data.

The speed of GigE is 0.1 GB/s. The speed of RAM and PCIe is 5-10 GB/s.

Consider a TOE adapter (TCP Offload Engine). This one can handle TCP all in the hardware itself relieving the CPU of this task.

You can write a filter driver that can look @ the data passing and trigger the GPU application…


If you think the scanning is still a problem – How about an FPGA card that will take the network traffic in and run them through state machines, identify the keywords and trigger the CPU to run a GPU application on some data. – this means you need to have lot of network related stuff on your FPGA.


Have you checked the on-stream paradigm from Netezza? www.netezza.com They give the ability to run custom-SQL queries which are executed in a MPP setup. Thus compute and storage are brought closer and this could be a cool thing too… but yeah – could be costly for your application

Best Regards,

What I heard here at my company is that people stay away from these because the lessen CPU load, but also decrease throughput.

Maybe using UDP is an option as this is faster as I understand it.

Sarnath, Riedijk …

All good info. One thing I failed to mention is that the data will be coming over UDP so TOE is out of the question. I thought about doing the same thing until I heard about the UDP part.

Alex …

The fact that data can’t come to us faster than 1Gbe is promising when it comes to the “real-time” processing we’re after. This was a driving force behind us moving away from end of day, 100GB of data approach. We’re not 100% committed to it yet, but things look good.

I wonder if it would be reasonable to have 1 NIC devoted to receiving data and 1 NIC devoted to sending it back out. Assuming that NIC_1 is pegged at 1Gbe, NIC_2 could be responsible for sending it back out. NIC_2 wouldn’t be sending near the load NIC_1 is receiving. With dual quad core CPUs in the machine we’re at a 1:1 for the core/GPU (2 x C1060 per machine) with 6 cores to spare. That leaves plenty of processing power for each NIC.

Ideas? Thoughts?

Thanks for acknowleding our answers.

This may be of help (juss my 2 cents):

Make sure you put the network cards under different host-PCI bridges so that they dont contend over the PCI for data… (Ultimately they would contend over for the system bus anyway). but system bus could be lot faster than PCI bus and thus time for wait could be less…

This is just my intuitive guess based on some old hardware knowledge. Not sure whether these things still hold true for modern day hardware

EDR, THanks for the info. Could be helpful some time.

May b, TOE works good when there are lot of small TCP packets… because (frame overhead/Data) ratio will be high in these cases.

WHen packet sizes are quite big, probably it does NOT matter at all.

I dont have first hand or even 2nd or 3rd hand info on the performance of TOE adapters.

Gigabit Ethernet NICs can send and receive 1Gb/s simultaneously, so you don’t need two of them.

For GigE you do not need TOE, and the fact that you’re using UDP only helps the situation. (UDP by its nature is lightweight, unless the UDP data packets themselves implement a heavy TCP-like protocol.)

What actually happens when you fire off your CUDA code? What do the algorithms entail? How much data are they working on at one time? How long does it take for them to run on the CPU? How often are they triggered per second, and do you need their results right away?


Thanks for the detailed info!

Best Regards,