GPU Communication Protocol

Hello,

I’m a researcher at University of São Paulo, and our involves the external parallel communication with CPU, and then the usage of the GPU for computation, however the time from getting the input data, and starting the gpu computing is to high, we are looking for a way to reduce it, it’s been suggested that we interface the PCI-express directly into the FPGA that provides the data.

Where, if available, I find the communication protocol to interface the GPU, and who, if not, I contact to obtain it through legal manners?

Thank
Carlos

I’m not aware of anyone who has access to this information outside of NVIDIA, and I suspect this is deliberate. You could email NVIDIA directly to ask.

(Not sure where the best place is. Perhaps the developer feedback form? http://developer.nvidia.com/object/contact_us.html )

So your intent is to try to implement a direct copy mechanism from an FPGA to a particular NVIDIA device memory? First of all, why do you think that the current implementation is inefficient? Say that you could stream the data at full speed from the FPGA to the GPU (at the full 8GB/s provided by 16x PCIe). How would this be any better than streaming at 8GB/s from the FPGA to the CPU then from the CPU to the GPU? If you performed the transfers serially, with no overlap, you would ideally get 4GB/s. By overlapping the copies you could do even better. Is the 2x (or less) increase in bandwidth really going to make your application benefit from CUDA? Given that NVIDIA’s implementation usually hits around ~6GB/s, do you really think that you could do better, dealing with signalling and protocol issues? And if you could, would it matter?

Finally, before you go trying to implement something like this yourself, regardless of the legal issues, consider the expected benefits compared to the expected effort required. PCIe is going to limit your bandwidth and latency, if you want to do better, you would need to move to another interconnect requiring physical, protocol, signalling, etc design…

We have a real-time system and face the same problem. We capture data from the sensor, crunch on it in the GPU and send it to the control. If we have to send data from the sensor to the CPU’s memory, then from there to the GPU, then from the GPU to the CPU’s memory, and from the CPU’s memory to the control, we have incurred a substantial amount of latency in the copying to and from the CPU’s memory. Ideally, we would DMA data direcly from the sensor to the GPU and from the GPU to the Control. The issue is latency not just throughput.

Marc Reinig

UCO Lick Observatory

Laboratory for Adaptive Optics

How much data are you transferring? Are you 1) transferring a few bytes and then kicking off a GPU kernel, or 2) streaming a series of samples to the GPU and batch processing them?

If 2) is the case then you can usually hide the latency via software pipelining and overlapping memory copies with kernels as long as the compute density of the kernel is relatively high.

We transfer 128 KB from a CCD to the GPU, process it, and transfer 64 KB to a deformable mirror. We operate at a frame rate of 2KHz and cannot tolerate a latency of larger than 500 microseconds. Pipelining only adds to the latency. The extra overhead of the transfers to and from the CPU’s memory is a significant problem.

This isn’t my field, by here’s an interesting paperwhich shows that roundtrip PCIe latencies are about 1000ns for 2K transfers. That’s a faster latency than I expected.

If GPUs exposed device to device bus transfers, it’d be very interesting to see how it’d be used. There’s a new $120 Ethernet board… it’s interesting in that it has its own embedded CPU onboard, running Linux and open source control software, and therefore you can do a lot of processing or data routing without having to involve the CPU. You could imagine modifying the onboard software to talk to a GPU too… giving you effectively an ethernet port access to a GPU without a CPU latency layer.

Did you try zero-copy access?, it basically skips the whole memory copy to the device, but instead copies it inside the kernel. It might shave off some latency in your case, as with that small amount of data, you probably do a lot of processing on the data, so the latency of copying it over the PCIe bus will be probably hidden, same thing for the writing to CPU memory.

Isn’t zero-copy only used on systems that have an integrated GPU that shares memory with the CPU? I’m using C1060s.

Also, there is no way of hiding latency. Latency is measured from the time the input is ready till the time the output is done. You can overlap memory transfer of new data with the processing of the previous data to increase throughput, but that doesn’t reduce the latency incurred.

Latency improvements are something I’ve been working on, and you’ll see the fruits of that work show up around 3.2.

I don’t know how much I can do on C1060, but there are significant improvements coming for GF100.

So let’s see. You have 500us to transfer 128KB from the CCD to some processor, process it, and send 64KB to a mirror. Each of these is a uni-directional transfer so you only get 1/2 of the available PCIe bandwidth.

Let’s assume ideal, full bandwidth over PCIe plus SPworley’s 1us latency.

The transfer from the CCD to the processor: 1us + 128KB/(4*2^30) = 31.5us

The transfer from the processor to the mirror: 1us + 64KB/(4*2^30) = 16.5us

So you ideally could have ~450us to play with on the processor. On a 3Ghz CPU with an IPC of 1 (pessimistic unless you are highly memory bound) you get about 1.3 million dynamic instructions. First question is how many do you need? If it is less than this, stop here.

On a Tesla C1060, assuming 30 1Ghz cores, you could get around 90 million dynamic instructions in the absolute best case including startup/teardown over PCIe (again assuming that you are not memory bound). Would this make a difference?

If your measured CCD-CPU or CPU-mirror latencies are significantly worse than my estimates, you might be fine just tuning your PCIe controller on your CCD and mirror and using the CPU for processing. If the CPU is the bottleneck, and switching to the GPU incurs a severe performance hit via PCIe you might consider using something like an FPGA, where you have the ability to modify the PCIe controller yourself. Just realize what you are getting yourself into before hand. My opinion is that if I started working on something like that now, I would finish sometime after NVIDIA releases a card based on the successor to PCIe :) .

I think you’re missing the point. I wasn’t asking how to evaluate the situation to decide whether to use a CPU, a GPU or give up developing the instrument. I was complaining that because of not being able to DMA to and from the GPU, we can’t use it. Instead we are developing an FPGA implementation.

The inability to DMA between the GPU and non memory devices is a significant limitation for hard real-time processing. This may not be a large market segment for NVIDIA, but it will not be addressed unless we bring it up as a limitation.

It’s a good point that such DMA control would give GPUs a big advantage in such latency limited situations.

But Greg’s question and musing is also interesting. The GPU may have too high latency now… it’s definitely challenging. But how far off is the GPU from what you need? Is it less than your 500 us maximum latency, wildly greater and therefore impractical, or small enough but has other issues?

It’d probably be interesting to at least see how far off the GPU really is by just coding a dummy tool that simply does a placeholder simple workload like summing all the data, and actually measuring what latency you get.

There’s no question a GPU tool would be, without exaggeration, an order of magnitude easier to design, test, and implement than an FPGA. Costs would also likely be an order of magnitude less. However it’s also true that an FPGA could clearly optimally minimize latency since you’re including the controller directly connected to the processor, and that’s can’t be beat.

So I’d think it’d be worthwhile to spend at least a day or two to verify that the GPU really would be too high latency… if it ends up being feasible, it’d save you man-years of work.

(*) Note. I am biased because the FPGA work I once experimented with was a terrible failure, it seemed wonderfully promising but even simple features took enormous effort. I was inexperienced, though, but it’s still such a huge contrast to GPU coding!

Perhaps you are right, I have certainly heard this argument before. However, the statement that you are making, along with what I have heard in the past, is qualitative and I have never seen any data to back it up. Most of the data that I have seen actually contradicts this point. For example, SPWorley’s source for 1us point to point transfer times of 2KB transfers over PCIe suggest that PCIe transfers can be very low latency for single hops. This link seems to corroborate his result http://cp.literature.agilent.com/litweb/pdf/5989-4076EN.pdf (single direction transfers see to be around 400-500ns for single-word packets). With no DMA, you are forced to pay for two transfers, but is 2us really so much different than 1us?

I will admit that I am not an expert in this area; I have never run into a problem that had hard-real requirements. However, that lack of DMA should only be costing you ~2us out of ~500us. I don’t mind being told that I am wrong, but I would at least like to know why.

People often arrive at this conclusion, and based on intuition it seems reasonable, but I am not convinced.

I wrote an application that measures the average latency of a memcpy to the gpu.

include <cstdio>

#include <cassert>

int main(int argc, char** argv)

{

		int * data;

		char temp = 0;

		unsigned int iterations = 1000000;

		cudaSetDevice(1);

		assert(cudaMalloc((void**)&data, 1) == cudaSuccess);

		cudaEvent_t start;

		cudaEvent_t stop;

		cudaEventCreate(&start);

		cudaEventCreate(&stop);

		cudaEventRecord(start);

		for(unsigned int i = 0; i < iterations; i++)

		{

				cudaMemcpy(data, &temp, 1, cudaMemcpyHostToDevice);

				assert(cudaSuccess == cudaThreadSynchronize());

		}

		cudaEventRecord(stop);

		cudaEventSynchronize(stop);

		float time = 0;

		cudaEventElapsedTime(&time, start, stop);

		time = (time / (float)iterations) * 1000.0f;

		printf("Latency %fus\n", time);

		return 0;

}

The result is 9.97us on a C1060… well under your 500us limit. For a 128KB transfer, I get 83.93 us.

And on GTX 480 / linux x86_64 / CUDA 3.0, I get 6.98us.

I wonder if the OP is running windows?

Same system with the new 3.1 beta: 6.08us :)

No, zero copy on non integrated GPUs means that you access the CPU memory over the PCIe bus in your kernel, we do use it with C1060’s and with better results than anticipated. That probably does reduce latency (it did in our case), as you skip the cudaMemcpy’s, and the more you need to calculate per data-element, the better the result (your latency will then be very close to the latency of the kernel itself, otherwise it will be close to the sum of the cudamemcpy’s (then you get your calculation for free))