PCIe Impact Give some examples of how PCIe impact your applications

HpC_lover · October 3, 2010, 12:14am

Hi all,

I’m working on a project that aims to make CPU->GPU and GPU → CPU memory transfers faster, and easier to manage on user’s source code.

In order to quantify the potential gains of this project, I would need your help to share your experiences about your problems with data communications.

I’m trying to see if this problem is a real problem, or if it has been erased by all the new features of GPUs like overlapping, and not only constrained to some very specific applications.

For example, could you respond to this :

Of which domain was this program about ?
Was this an industrial application ?
Had you been able to overlap and hide all data communications ?
How much time did the GPGPU project took you comparing to the CPU one ?

If you didn’t hide all the data communications:

How much data communications had affected the perfs of your application ?
What would be the gains if we totally suppress the need of transfer ?
Was the communications a bottleneck ?

If you have real example from yourself, or with a paper dealing with this subject, any help would be great.

Thank you all.

HpC_lover · October 3, 2010, 12:14am

Hi all,

I’m working on a project that aims to make CPU->GPU and GPU → CPU memory transfers faster, and easier to manage on user’s source code.

In order to quantify the potential gains of this project, I would need your help to share your experiences about your problems with data communications.

I’m trying to see if this problem is a real problem, or if it has been erased by all the new features of GPUs like overlapping, and not only constrained to some very specific applications.

For example, could you respond to this :

Of which domain was this program about ?
Was this an industrial application ?
Had you been able to overlap and hide all data communications ?
How much time did the GPGPU project took you comparing to the CPU one ?

If you didn’t hide all the data communications:

How much data communications had affected the perfs of your application ?
What would be the gains if we totally suppress the need of transfer ?
Was the communications a bottleneck ?

If you have real example from yourself, or with a paper dealing with this subject, any help would be great.

Thank you all.

Gregory_Diamos · October 3, 2010, 12:56am

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

Gregory_Diamos · October 3, 2010, 12:56am

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

HpC_lover · October 3, 2010, 1:31am

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

Thank you for your response.

I’m very interested about your study on this.

Is there any way for you to share the source codes, detailed models or the modifications you did about this ?

I’ve heard a lot that GPGPU was more considerated as “a technological demonstration, like a very specific accelerator more than a reliable industrial solution”.

Maybe you’re right and the reason can be that the applications that are offloaded today had been choosen because they aren’t pcie bandwidth/latency dependent.

But I really want to know if you or anybody here had bad experiences with pci express.

Did you write some by hands ?

HpC_lover · October 3, 2010, 1:31am

I looked into this a little bit at one point. The conclusions were basically that most existing cuda applications are written such that they can tolerate the latency and bandwidth limitations of PCIe. Furthermore, the bulk-synchronous programming model encourages developers to offload very heavy-weight tasks that have a lot of potential for latency hiding. The result is that existing CUDA applications would see no benefit from improving the bandwidth/latency of PCIe.

I wrote up some of the findings informally here: http://www.gdiamos.net/papers/cudaLatency.pdf

Of course there may also be some applications that are simply not ported to CUDA because they are latency sensitive, and I haven’t come up with a good way of studying these without actually going out and writing them by hand…

Thank you for your response.

I’m very interested about your study on this.

Is there any way for you to share the source codes, detailed models or the modifications you did about this ?

I’ve heard a lot that GPGPU was more considerated as “a technological demonstration, like a very specific accelerator more than a reliable industrial solution”.

Maybe you’re right and the reason can be that the applications that are offloaded today had been choosen because they aren’t pcie bandwidth/latency dependent.

But I really want to know if you or anybody here had bad experiences with pci express.

Did you write some by hands ?

SPWorley · October 3, 2010, 1:34am

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.
I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).
And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

SPWorley · October 3, 2010, 1:34am

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.
I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).
And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

Gregory_Diamos · October 3, 2010, 2:20am

The models in this paper used Ocelot to capture and timestamp all of the transactions between the CUDA runtime and the GPU device and then reply them into a network simulator.

The trace capturing function in Ocelot is defined here: http://code.google.com/p/gpuocelot/source/…ngCudaRuntime.h . I haven’t tried it in about a year, but ideally you should be able to use it to capture a trace of all messages that the CUDA runtime sends to the GPU and how much slack (CPU time) you have between each call.

I personally haven’t. I usually experiment with CUDA benchmarks such as Parboil and Rodinia, but obviously these aren’t right for this kind of study. The only applications that I end up writing deal with specific projects that my group is working on and none of them so far have been PCIe bound and are usually possible to completely offload to the GPU.

Also you might want to take a look at this: http://www.ece.ubc.ca/~aamodt/papers/hwong.wddd2009.pdf , it is a limit study of porting existing CPU benchmarks to a CPU/GPU system without modification assuming that you could arbitrarily move all live data to the GPU, do some computation, and move all results back. They show that the right place to split the program changes with the interconnect performance. However, this approach does not allow you to re-write existing applications.

Gregory_Diamos · October 3, 2010, 2:20am

The models in this paper used Ocelot to capture and timestamp all of the transactions between the CUDA runtime and the GPU device and then reply them into a network simulator.

The trace capturing function in Ocelot is defined here: http://code.google.com/p/gpuocelot/source/…ngCudaRuntime.h . I haven’t tried it in about a year, but ideally you should be able to use it to capture a trace of all messages that the CUDA runtime sends to the GPU and how much slack (CPU time) you have between each call.

I personally haven’t. I usually experiment with CUDA benchmarks such as Parboil and Rodinia, but obviously these aren’t right for this kind of study. The only applications that I end up writing deal with specific projects that my group is working on and none of them so far have been PCIe bound and are usually possible to completely offload to the GPU.

Also you might want to take a look at this: http://www.ece.ubc.ca/~aamodt/papers/hwong.wddd2009.pdf , it is a limit study of porting existing CPU benchmarks to a CPU/GPU system without modification assuming that you could arbitrarily move all live data to the GPU, do some computation, and move all results back. They show that the right place to split the program changes with the interconnect performance. However, this approach does not allow you to re-write existing applications.

Sarnath · October 4, 2010, 7:36am

Just my 2 cents:

Good/Badness of PCI-e Latency depends on how much time you spend crunching data on GPU. I have seen the PCI-e thing pop up in “Real-time” applications - where in - you have frames of data coming, you move it to GPU, compute it and move it again back… Thats where it gets dirty.

Sarnath · October 4, 2010, 7:36am

Just my 2 cents:

Good/Badness of PCI-e Latency depends on how much time you spend crunching data on GPU. I have seen the PCI-e thing pop up in “Real-time” applications - where in - you have frames of data coming, you move it to GPU, compute it and move it again back… Thats where it gets dirty.

Nighthawk13 · October 4, 2010, 10:53am

As SPWorley, i used zero copy in a computation-heavy kernel with great success:
In a raytracer which runs about 15ms per kernel call, i can upload two 800x600x4Bytes images basically “for free”(no measurable time increase).

I was less successful when trying asychronous memcpy for large data transfers(they were never faster than the blocking memcpy, always causing a measurable slowdown).

Nighthawk13 · October 4, 2010, 10:53am

As SPWorley, i used zero copy in a computation-heavy kernel with great success:
In a raytracer which runs about 15ms per kernel call, i can upload two 800x600x4Bytes images basically “for free”(no measurable time increase).

I was less successful when trying asychronous memcpy for large data transfers(they were never faster than the blocking memcpy, always causing a measurable slowdown).

HpC_lover · October 17, 2010, 9:01pm

Sorry for this late response, but, I’ve heard about a functionality in NVCC that could explain this.

In deed, NVCC optimize your code, thus do you think that NVCC had packed all your datas for you ?

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.

I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).

And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

HpC_lover · October 17, 2010, 9:01pm

Sorry for this late response, but, I’ve heard about a functionality in NVCC that could explain this.

In deed, NVCC optimize your code, thus do you think that NVCC had packed all your datas for you ?

As an anecdotal point, I had a problem when loading geometric models of over 50 million vertices into the GPU, simply because of the limited device memory.

I experimented with various compression and compaction methods. One of the last methods I tried was the simplest… leave the whole database in host memory and access it via zero-copy PCIe. I expected that it’d be horrifically slow, since the data is accessed so intensely… but was pleasantly shocked to find that on Fermi, speeds were barely affected (less than 1%). On GT200, speeds were hit by about 35% (still far far better than I expected.)

I now store my base geometry data on the host and use zero copy even when there’s room on the device! There was no penalty so why not make the robust method the default?

I give credit to NV’s thread-level parallelism design which can hide such outrageous latencies (likely multiple thousands of clocks for PCIe reads).

And further credit to GF100’s design, specifically the on chip caches that effectively eliminated the latencies altogether.

Of course of course of course you can have tasks which are completely limited by PCIe, and those tasks may even be common. But I haven’t hit one yet.

I was just talking at GTC with Mr. Anderson about HOOMD’s multi-GPU, where PCIe communication was a limiting factor in coordinating work.

There’s also a lot of common examples of trivial problems like adding two giant vectors, which are PCIe limited, and are often one of the first CUDA programs people try. Each time a new programmer does that, they are left with an an immediate negative impression since their GPU code is slower than the same code on the CPU just because of these overheads.

Topic		Replies	Views
GPU Communication Protocol CUDA Programming and Performance	16	6474	May 17, 2010
PCI Express Latency and how to decrease it CUDA Programming and Performance	7	19723	January 31, 2011
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8289	June 30, 2010
bi-directional PCI-E transfer overlap CUDA Programming and Performance	23	37503	July 30, 2010
Memory bandwidth CUDA Programming and Performance	31	38666	October 5, 2007
The fastest platform of GPU computing CUDA Programming and Performance	38	40554	January 19, 2010
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68333	April 18, 2008
CUDA and openCL support for multiple GPU/PCI devices? CUDA Programming and Performance	7	5444	April 11, 2009
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11241	December 26, 2008
Is CUDA really that fast? CUDA Programming and Performance	17	11875	September 21, 2009

PCIe Impact Give some examples of how PCIe impact your applications

Related topics