Petrel with Nvidia Grid K2

Dear All,

Would you please provide some help on Petrel specific tuning using Grid K2.I’m having a problem with GPU Pass Through using XenDesktop 7 using large interpreted Petrel files. We have 48 GB memory and 8 core dedicated 2.6 Ghz E5-2650 v2 CPU. During usage, we see that there are still enough resources and network is fine.

We pinned all CPU cores to same socket. We are fine with smaller interpreted files. but with large files it works worse than simple 3.2 Ghz 4 core workstation with simple Nvidia Fx320 card.

We know that Petrel does not scale well with cores. So 3.2 Ghz may be better than 2.6 Ghz but we do not see any specific core peaking to 100%.

I see some success stories at 2013 GPUTechconf. Any specific Petrel & Citrix settings anybody can suppy us, we greatly appreciate it.

http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpu-accelerated-xendesktop-beyond-3d-designers.pdf

thanks

Serdar

Turkisk Oil and Gas

Hi, Serdar:
The first thing I would look at if the CPU is not the limitation is with large files, how your data are being delivered. Is perhaps your disk I/O the limiting factor? Also, what are you using for the client – some sort of PC, thin client, etc.? This can also in cases of large frames be a bottleneck.

Tobias makes a great point, especially given that it works well with smaller files. Serdar, you said networking is fine but are you separating storage from other network traffic and can you see any issues there? Using jumbo frames at any point? Can you bring the large files adjacent to the XD deployment to see if that changes anything?

Yes, the storage connection – in addition to the implementation of the of storage, including disk RPMs, SATA vs. SAS, type of RAID configuration, number of spindles, connection type (iSCSI, HBA, NFS, etc.) – can all make a difference, as well as of course how many VMs are shared on the same SR and how the SR is carved out of a LUN. With jumbo frames I honestly have never seen more than perhaps a 10 - 15% improvement, but if it can be implemented, it’s certainly a bonus. As Like mentioned above, isolating the storage network from all others is very important.

Hi Serdar,

If you are still experiencing difficulty let us know. We’ve had great success in deploying NVIDIA GRID with XenDesktop for SLB Petrel over the past couple of years. Even with a ~2TB data-set (once I’d got it off the portable USB drive at about 25MB/s…). There will be a case study shortly on what we did to get it all running smoothly with HDX3DPro. Meanwhile there’s a snippet here:

http://360is.com/performance.html

The geologists were delighted.
As Tobias says there can be so many things that might be the cause of poor performance, and we deal with them all.

N.

I hate to say it, but we still use direct I/O (using open iSCSI) for our more critical (where speed is essential) storage needs. By bypassing the SR mechanism altogether you can get up to 6x the performance. I have not tested this yet on a Creedence Alpha platform, but suspect with all the improvements that have been made in Creedence that I could get some impressive stats – maybe still not as good as bare metal or Brand X, but still way better that the “stock” system. While I have not been able to (obviously) get storage Xenmotion to work with the opne iSCSI mechanism, Xenmotion itself does. Maybe that isn’t even relevant since at least for a VM tied to a vGPU, Xenmotion is not currently functional, anyway.

P.S. Hey, 360: Love your Web site and tongue-in-cheek comments interspersed all over the place!

Hi Tobias,

We’ve been working with XenServer since about 2006 so are aware of the long journey those in search of performance have taken. Storage performance, particularly shared storage performance needs work. What I really want to see is Infiniband/SRP for storage so we can dump TCP/IP entirely… Maybe a client will pay us to work on it!

For what its worth, we’ve done a few NVIDIA/vGPU projects now and they have worked pretty well in terms of both performance and stability so long as the client has done his homework on hardware or allows us to specify it all.
Glad you like the new web site, it is still a work in progress!

Hi,

My company is a systems integrator in Ghana. We have just completed the deployment of a XD solution for a small oil services company. However Petrel runs badly for large files.

Our setup consists of HP DL380z G8 Server, XenServer 6.2.2, 4x Grid K2 GPUs (in pass-through mode), Windows 7 SP1 64-bit, 8 physical CPU cores per VM, 64GB RAM per VM, EMC SAN (we record 130Mbyte/s when transferring files between VMs), XD 7.5, and NVIDIA drivers on the VMs.

As a result of the bad Petrel performance, we decided to change the hypervisor to VMware in case that was the problem however the performance issue persists.

Local office Schlumberger engineers (by the way they are not enthused with our ‘virtualization’ approach) have just communicated to us that Grid K2 is not supported by Petrel 2014. They have requested we use Quadro K5000 or K6000. We have now placed order for two K6000 cards with the intention of replacing the Grid K2 cards, but somehow I get the feeling the problem could be something else.

I would appreciate any suggestions on the matter.

Felix

I doubt the problem is with the K2 cards, especially if you are using pass-through mode where Petrel can access all the GPU memory. Petrel with GRID is quite a new technology, so I can understand the Schlumberger guys being nervous about it, they have a lot more Quadro (also pass-through) sites.

It is hard to make helpful suggestions from just the information supplied, how bad is the performance versus a physical workstation with similar storage/network/CPU?

Are we talking 2x slower, 10x slower, more?

Hi threesixty,

Thanks for your answer.

I’m working on the same project with Felix and we run the same data from a laptop that has an NVIDIA Quadro card and it loaded the data about 10x faster. Also I would like to mention that this is happening for Seismic Data (4GB). We have a good network storage with its dedicated vlan. And we also tried running it from local drives witch didn’t make any difference.

What are we missing here?

Elom

Also at the client side we are running Citrix Receiver on an HP EliteDesk 800 Series (I5 Cpu, 4Gb ram) connected to a 30" screen at a resolution of 2560 x 1600.

For the imaging technology of the VMs we use Machine Creation Service (MCS).

What are the IOPS for the respective implementations?

It’s possible that the storage fabric is introducing a bottleneck that manifests as slow loading of data.

Hi Jason,

Thanks for your response.

I checked the IOPS on the storage and I attached the images below. The Hypervisor accesses the storage through NFS and it is able to handle peaks above 1000 IO/s. The File System itself also handles peaks above 400 IO/s. The network throughput is also handling more than 22000 Packets/s.

The Storage is a VNXe 3200 with Sas drives (25 x 600 GB, 10K RPM) using Raid 5; Jumbo Frames Disabled.

Note that when copying files from the storage, windows records speeds up to 130MB/s.

Do you think it is not enough? What should we do to improve?

NFS Throughput:

Network Throughput:

File System Throughput:

Are those results from in the VM?

This is a good paper on what we’d be looking at in the VM as a potential bottleneck, especially when it comes to accessing large amounts of data.

http://www.atlantiscomputing.com/downloads/Windows_7_IOPS_for_VDI_a_Deep_Dive_1_0.pdf

Hi,

I’ve been working on a similar setup but using VMware with direct attached GPU’s.

Our configuration 4:1 consolidation of VM’s to Physical servers

6-Cores per VM @ 3Ghz Per Core
48GB Memory Per VM
Dedicated GRID K2 Core per VM
Storage on a local FusionIO IOscale2 SSD

We also have it configured to have two VM’s per 10GbE adaptor through a vSwitch.

Uses, ESXi 5.5U1
VMware Horizon 6.0.0
APEX 2800 Offload card by Teradici to offload PCoIP sessions to hardware

We benchmarked Petrel 2013.7 on the virtual machines against a T7610 Workstation on the same storage accross 10GbE…

Seismic Prefetch for a 20GB Cube (Uncompressed) ZGY took:

84.3 Seconds for the T7610 Approx 235MB/s
92.5 Seconds for the Virtual Machine Approx 225MB/s

We didn’t test data sets higher than 20GB.

This is based on our VMDK’s sitting on a FusionIO IOscale2 1.6TB card and 10GbE to the Storage.

For the clients we used Dell Wyse P45 Zero Clients on a 1GbE Base-T connection.

All drivers were at their latest versions and this was on Windows 7.

Storage was NetApp.

We also tested seismic compression in Petrel 2013.7 compressing 35GB of ZGY @ 45dB with the following results:

T7610 Spec - 16 Cores @ 3.4Ghz, 192GB Memory, K4000: File Creation Time 281s CPU Cores Maxed Out
Virtual Workstation - 6 Cores @ 3Ghz, 48GB Memory, GRID K2: File Creation Time 663s CPU Cores Maxed Out

I hope this gives you some baseline figures on what to expect, I’d be interested to know how XenServer and HDX perform in comparison.

There was no tweaking done to obtain these figures, just standard VMware off the shelf configuration for VMware Horizon.

I’ll say this, Petrel is not certified to run in a virtual environment therefore getting support from Schlumberger would at best be on a best effort basis.

If you’d like to talk more around the work we’ve done, send me a private message with your e-mail address.

Thanks
J

Hello Elom,

Have you tested PVS instead of MCS with "cache in device RAM with hard disk overflow"? PVS is included in XenDesktop Enterprise/Platinum and works well with both XenServer and vSphere (v5.5 update 1).

You can read more about the new caching mode in this blog series:
http://blogs.citrix.com/2014/04/18/turbo-charging-your-iops-with-the-new-pvs-cache-in-ram-with-disk-overflow-feature-part-one/

It would be interesting to see the results you would get.