PCIe Impact Give some examples of how PCIe impact your applications

Hi all,

I’m working on a project that aims to make CPU->GPU and GPU → CPU memory transfers faster, and easier to manage on user’s source code.

In order to quantify the potential gains of this project, I would need your help to share your experiences about your problems with data communications.

I’m trying to see if this problem is a real problem, or if it has been erased by all the new features of GPUs like overlapping, and not only constrained to some very specific applications.

For example, could you respond to this :

  • Of which domain was this program about ?
  • Was this an industrial application ?
  • Had you been able to overlap and hide all data communications ?
  • How much time did the GPGPU project took you comparing to the CPU one ?

If you didn’t hide all the data communications:

  • How much data communications had affected the perfs of your application ?
  • What would be the gains if we totally suppress the need of transfer ?
  • Was the communications a bottleneck ?

If you have real example from yourself, or with a paper dealing with this subject, any help would be great.

Thank you all.

Hi all,

I’m working on a project that aims to make CPU->GPU and GPU → CPU memory transfers faster, and easier to manage on user’s source code.

In order to quantify the potential gains of this project, I would need your help to share your experiences about your problems with data communications.

I’m trying to see if this problem is a real problem, or if it has been erased by all the new features of GPUs like overlapping, and not only constrained to some very specific applications.

For example, could you respond to this :

  • Of which domain was this program about ?
  • Was this an industrial application ?
  • Had you been able to overlap and hide all data communications ?
  • How much time did the GPGPU project took you comparing to the CPU one ?

If you didn’t hide all the data communications:

  • How much data communications had affected the perfs of your application ?
  • What would be the gains if we totally suppress the need of transfer ?
  • Was the communications a bottleneck ?

If you have real example from yourself, or with a paper dealing with this subject, any help would be great.

Thank you all.

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

Thank you for your response.

I’m very interested about your study on this.

Is there any way for you to share the source codes, detailed models or the modifications you did about this ?

I’ve heard a lot that GPGPU was more considerated as “a technological demonstration, like a very specific accelerator more than a reliable industrial solution”.

Maybe you’re right and the reason can be that the applications that are offloaded today had been choosen because they aren’t pcie bandwidth/latency dependent.

But I really want to know if you or anybody here had bad experiences with pci express.

Did you write some by hands ?

Thank you for your response.

I’m very interested about your study on this.

Is there any way for you to share the source codes, detailed models or the modifications you did about this ?

I’ve heard a lot that GPGPU was more considerated as “a technological demonstration, like a very specific accelerator more than a reliable industrial solution”.

Maybe you’re right and the reason can be that the applications that are offloaded today had been choosen because they aren’t pcie bandwidth/latency dependent.

But I really want to know if you or anybody here had bad experiences with pci express.

Did you write some by hands ?

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.
I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).
And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.
I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).
And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

The models in this paper used Ocelot to capture and timestamp all of the transactions between the CUDA runtime and the GPU device and then reply them into a network simulator.

The trace capturing function in Ocelot is defined here: http://code.google.com/p/gpuocelot/source/…ngCudaRuntime.h . I haven’t tried it in about a year, but ideally you should be able to use it to capture a trace of all messages that the CUDA runtime sends to the GPU and how much slack (CPU time) you have between each call.

I personally haven’t. I usually experiment with CUDA benchmarks such as Parboil and Rodinia, but obviously these aren’t right for this kind of study. The only applications that I end up writing deal with specific projects that my group is working on and none of them so far have been PCIe bound and are usually possible to completely offload to the GPU.

Also you might want to take a look at this: http://www.ece.ubc.ca/~aamodt/papers/hwong.wddd2009.pdf , it is a limit study of porting existing CPU benchmarks to a CPU/GPU system without modification assuming that you could arbitrarily move all live data to the GPU, do some computation, and move all results back. They show that the right place to split the program changes with the interconnect performance. However, this approach does not allow you to re-write existing applications.

The models in this paper used Ocelot to capture and timestamp all of the transactions between the CUDA runtime and the GPU device and then reply them into a network simulator.

The trace capturing function in Ocelot is defined here: http://code.google.com/p/gpuocelot/source/…ngCudaRuntime.h . I haven’t tried it in about a year, but ideally you should be able to use it to capture a trace of all messages that the CUDA runtime sends to the GPU and how much slack (CPU time) you have between each call.

I personally haven’t. I usually experiment with CUDA benchmarks such as Parboil and Rodinia, but obviously these aren’t right for this kind of study. The only applications that I end up writing deal with specific projects that my group is working on and none of them so far have been PCIe bound and are usually possible to completely offload to the GPU.

Also you might want to take a look at this: http://www.ece.ubc.ca/~aamodt/papers/hwong.wddd2009.pdf , it is a limit study of porting existing CPU benchmarks to a CPU/GPU system without modification assuming that you could arbitrarily move all live data to the GPU, do some computation, and move all results back. They show that the right place to split the program changes with the interconnect performance. However, this approach does not allow you to re-write existing applications.

Just my 2 cents:

Good/Badness of PCI-e Latency depends on how much time you spend crunching data on GPU. I have seen the PCI-e thing pop up in “Real-time” applications - where in - you have frames of data coming, you move it to GPU, compute it and move it again back… Thats where it gets dirty.

Just my 2 cents:

Good/Badness of PCI-e Latency depends on how much time you spend crunching data on GPU. I have seen the PCI-e thing pop up in “Real-time” applications - where in - you have frames of data coming, you move it to GPU, compute it and move it again back… Thats where it gets dirty.

As SPWorley, i used zero copy in a computation-heavy kernel with great success:
In a raytracer which runs about 15ms per kernel call, i can upload two 800x600x4Bytes images basically “for free”(no measurable time increase).

I was less successful when trying asychronous memcpy for large data transfers(they were never faster than the blocking memcpy, always causing a measurable slowdown).

As SPWorley, i used zero copy in a computation-heavy kernel with great success:
In a raytracer which runs about 15ms per kernel call, i can upload two 800x600x4Bytes images basically “for free”(no measurable time increase).

I was less successful when trying asychronous memcpy for large data transfers(they were never faster than the blocking memcpy, always causing a measurable slowdown).

Sorry for this late response, but, I’ve heard about a functionality in NVCC that could explain this.

In deed, NVCC optimize your code, thus do you think that NVCC had packed all your datas for you ?

Sorry for this late response, but, I’ve heard about a functionality in NVCC that could explain this.

In deed, NVCC optimize your code, thus do you think that NVCC had packed all your datas for you ?