VDI click to photon with raspberry-pi


I have result click-to-photon for my raspberry-pi thin platform as inspired from #GRIDDays (see blogs https://gridforums.nvidia.com/default/topic/734/general-discussion/-griddays/).

Attention: The presented solution IS NOT Citrix HDX Ready Pi, ThinOS, ThinLinX, TLXOS, RPiTC, VTware, NoMachine, BerryTerminal, RDP, VNC, PARSEC … This is raw h264 streaming rtp/udp protocol and without Xorg server (eg. direct OpenMAX) and usbredir tcp/ip protocol, see below.

My best results click-to-photon is 60ms (95 ms with aero composer enabled).

VDI runs 1280x1024@30 eg. every 33ms encoder receive frame (7), HDMI runs 1280x1024@60 eg. vsynced every 16ms HDMI send frame (13). The worst double “vsync” wait time is 49ms (=33+16) (eg. the results 60-109ms are expected and measured). I used “HP LP1965” monitor with measured inputlag about 5ms (http://www.svethardware.cz/recenze-hp-lp1965-legenda-pokracuje/15859-3). The monitor has embedded high powered USB hub that also powering thin client (raspberry-pi and USB periferials, raspberry-pi is attached to monitor stand) and audio speakers.

Some comments:

(1) I soldered wires directly to left mouse button and attached to my oscilloscope.

(2-4) The usbredir protocol is opensource. It transparently connect any device to remote VM (qemu emulates uhci/ehci usb2).

(5) I used powerpoint black/white color button to change slide background.

(6) There is NO additional VDI agent software installed to DomU (except vGPU driver). I am using/testing not only windows but Linux too (Centos, Debian and SteamOS) with K1/K2 backend. There is no additional load in DomU on CPU or GPU (including NVenc) from VDI agent software ! There are no DomU software collisions (see some crashing stories) or no DomU dependencies/driver incompatibilities (see NVIFR/NVFBC problems).

(7) The rendered framebuffer is included in pci memory region shared with Dom0. It is available not only to accelerated from k###q but also to emulated VGA that eases installation and maintenance of any OS (including windows recovery states). There are many unresolved (>1year) NVidia bugs (like DX11 fullscreen, out-of-order frame delivery, mouse pointer …).

(8-10) The H264 is encoded on Maxwell Gen1 (K2200) that is 3x faster than Kepler and K2200 is optimal price/performance (5x 1280x1024@40 generated ~33% NVENC load ("nvidia-smi pmon")). There will not be need of separated encoder domain if NVidia disclosures API for direct access to NVenc without CUDA (CUDA does not work in Dom0 for more than 5years). "stream" multithreaded executable is about 28kB on disk size (VmRSS 25MB + GPU 25MB, and 240MB GPU memory per encoded h264 stream) and no other processes are needed. Inter domain sharing is based on xenstore.h,xen/evtchn.h,xen/gntalloc.h,xen/gntdev.h.

(11) The standard RTP/UDP protocol is used to transport h264 video stream (and also audio PCM stream in separate channel) to minimize software overhead. The HTB linux queuing discipline is used to shape traffic to raspbery-pi (it has only 100Mbit ethernet input) on separated secured VLAN. I also tested openVPN tunnel to secure VDI channel (not included in benchmark).

(12-13) The rasperry-pi OpenMAX IL modules (with accelerated h264) are used to decode and output video (and audio if supported) to HDMI. The peak load on raspberry-pi 2 is under 30%. "thin" multithreaded executable is about 110kB on disk size (VmSize 165kB) and no other processes are needed (eg. no Xorg, no window manager … only ssh for remote monitoring).

(14) The photon is emitted finally to phototransistor and to my oscilloscope.


Nice work Martin - can you send me a private message with your email address?

Hi MArtin,

A colleague passed these on and I thought you’d find interesting!

(it’s a wall of text, scroll down to the “Audio Video Sync” Section, and under there find “Method 2” – it should look familiar – it goes back to 2008 when BBC HD first started broadcasting).

measuring games console latency (where it’s a closed box that you can’t modify at a code level)…

Best wishes,

you might also like https://www.citrix.com/blogs/2014/07/30/introduction-to-video-capture-hardware-for-hdx-product-demonstrations/


Thin client raspberry-pi can measure click to photon with its own digital inputs without oscilloscope.
I attached mouse button (verified that is 3.3 volt and on same ground, over safety 680 Ohm resistor) to GPIO1 (=BCM GPIO 18) and phototransistor collector (BPW42 with pullup 82k ohm to 3.3V) to GPIO2 (=BCM GPIO 27). There is simple measurement program:

// compile:  gcc oscillo.c -o oscillo -lwiringPi -lpthread

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <wiringPi.h>

int ts = 0;
int ts_min = 0;

void pin_mousebutton(void)
	int _milli = millis(); 		// measure ASAP
	if (ts) return;			// already started measure
	if (digitalRead(1) == LOW) {	// falling edge (pressed button)
		ts_min = _milli;	// start minimum press interval
	else 				// rising edge (released button)
		if (!ts_min || ((_milli - ts_min)<20)) {
			// too short pressed (switching contact glitch), restart

	ts = _milli;			// start measure
	struct timespec time;
	clock_gettime(CLOCK_REALTIME, &time);
	printf("%d.%09ld %c ", (int)time.tv_sec, time.tv_nsec, 

void pin_phototransistor(void)
	int _milli = millis(); 		// measure ASAP
	if (!ts) return;		// measure not started
	printf (" = %d\n", _milli-ts);
	ts = ts_min = 0;		// mark measure ended


	pinMode (0, OUTPUT);	// LED output with 270 Ohm resistor
	pinMode (1, INPUT);	// mouse button connected over 680 Ohm resistor
	pinMode (2, INPUT);	// phototransistor collector with 82k Ohm pullup
	pullUpDnControl (1, PUD_OFF) ;
	pullUpDnControl (2, PUD_OFF) ;

	wiringPiISR (1, INT_EDGE_BOTH,  pin_mousebutton) ;
	wiringPiISR (2, INT_EDGE_BOTH,  pin_phototransistor) ;

	while (1) {		// simple "live" blinking led
		digitalWrite(0, HIGH); 
		digitalWrite(0, LOW); 

When I compare raspberry-pi results with oscilloscope measurement there should be some minor corrections due to different oscilloscope measurement point and LCD crystal asymmetric transition delay (transition white->black add +2ms to raw results (marked B), transition black->white subtract -5ms (marked W)). Raw results (without corrections) for PowerPoint black/white color switch:

1459726772.711643608 B  = 75
1459726773.115839450 W  = 83
1459726773.552274891 B  = 67
1459726773.961169842 W  = 105
1459726774.489215596 B  = 80
1459726774.938764094 W  = 110
1459726775.366921888 B  = 69
1459726776.041165808 W  = 90
1459726776.536406859 B  = 82
1459726776.924494699 W  = 74
1459726777.323792682 B  = 95
1459726777.668361717 W  = 97
1459726778.069624802 B  = 65
1459726778.428420331 W  = 119
1459726778.819065407 B  = 82
1459726779.948055060 W  = 83
1459726780.352381475 B  = 65
1459726780.725309123 W  = 106
1459726781.114302847 B  = 69
1459726817.140396398 W  = 111
1459726818.712987579 B  = 75
1459726819.157220147 W  = 76
1459726819.617433373 B  = 70
1459726819.995031902 W  = 106
1459726820.472563234 B  = 98
1459726820.849051088 W  = 101
1459726821.311636447 B  = 59
1459726821.704164647 W  = 112
1459726822.142001337 B  = 79
1459726887.312982720 W  = 67
1459726887.789655512 B  = 76
1459726888.303486335 W  = 108
1459726888.777318557 B  = 71
1459726889.264999723 W  = 63
1459726889.739609132 B  = 108
1459726890.148505907 W  = 113
1459726890.552738157 B  = 78
1459726891.067034083 W  = 94


I finally did some deep dive one frame measurement for click to photon VDI with raspberry-pi (RPI).
I add some real-time timestamps to Dom0 vgpu encoder bridge, tcpdump to capture packets on RPI and XEN Dom0 and finally RPI click-to-photon measurement program. Here is the sample result:

Some comments:

  • The numbered circles are aligned with first post.
  • VDI runs 1280x1024@30 eg. every 33ms encoder receive frame (B) (7), HDMI runs 1280x1024@60 eg. vsynced every 16ms HDMI send frame (D) (13).
  • The realtime timestamps (RPI and XEN Dom0) are displayed in ms resolution without seconds (see raw data). There is ~ +3ms difference between RPI and XEN Dom0 realtime.
  • I am happy with encoder (8)-(10). It is usually very fast. I measured some more delay with large picture change.
  • There is not additional network delay. I used only one switch.
  • I see challenge in usb Dom0<->DomU emulation (A). The emulation uses many cpu Dom0 cycles (up to ~20% cpu). There are also problems with usb ISO transfers (for example webcams). It is discussed in Xen community to extend usb (UHCI/EHCI/XHCI) as paravirtualized drivers (DomU additions) like disk and network.
  • Another delay is generated inside acceleration VC4 coprocessor (C) (I guess ~15ms). The observability and more tuning is discussed in RPI community.
  • The actual "VSYNC" delays for this sample were estimated from analysis other samples (DomU->Dom0 delay (B) ~2ms and RPI->LCD delay (D) ~12ms).
  • Raw data for analysis:
    <b>RPI click-to-photon measurement program (1) (13)</b>
         1459803325.201227677 B  = 74
    <b>vgpu encoder bridge in Dom0 (7) (8) (10)</b>
    vgpu-7[7726]: surface_update: received new frame id 12522 1459803325.234834490
    vgpu-7[7726]: nvenc_encode: copy to encoder frame id 12522 1459803325.236695397
    vgpu-7[7726]: nvenc_thread: copy from encoder frame id 12522 1459803325.240962394
    <b>Wireshark/tcpdump in RPI (3) (11)</b>
    No.     Time                       Source                Destination           Protocol Length Info
          1 1459803325.206065          TCP      79     51857 > 20037 [PSH, ACK] Seq=1 Ack=1 Win=115 Len=25
          2 1459803325.206298        TCP      60     20037 > 51857 [ACK] Seq=1 Ack=26 Win=131 Len=0
          3 1459803325.237622        UDP      1450   Source port: 20034  Destination port: 48107
          4 1459803325.237625        UDP      152    Source port: 20034  Destination port: 48107
          5 1459803325.238319          UDP      58     Source port: 48107  Destination port: 20034
    <b>Wireshark/tcpdump in Dom0 (3) (11)</b>
    No.     Time                       Source                Destination           Protocol Length Info
          1 1459803325.209950          TCP      79     51857 > 20037 [PSH, ACK] Seq=1 Ack=1 Win=115 Len=25
          2 1459803325.209987        TCP      54     20037 > 51857 [ACK] Seq=1 Ack=26 Win=131 Len=0
          3 1459803325.241091        UDP      1450   Source port: 20034  Destination port: 48107
          4 1459803325.241119        UDP      152    Source port: 20034  Destination port: 48107
          5 1459803325.242197          UDP      60     Source port: 48107  Destination port: 20034

I am looking forward to #GTC16 replays or blogs or other disclosures about remoting protocols and its analysis:

  • S6209 - A Look at Real World Performance Capabilities of NVIDIA GRID™ 2.0 (click-to-photon @ 11min - live demo, :-) @ 19:30) http://on-demand.gputechconf.com/gtc/2016/video/S6209.html
  • S6198 - The Latest in High Performance Desktops with VMware Horizon and NVIDIA GRID™ vGPU (click-to-photon @ 35min) http://on-demand.gputechconf.com/gtc/2016/video/S6198.html
  • S6622 - Advances in Remoting Protocol Technology for 3D Graphics http://on-demand.gputechconf.com/gtc/2016/video/S6622.html
  • S6218 - TeamRGE.com - From the Fire Hose Series: Benchmarking and Scalability in Virtual Desktop Infrastructure (VDI) and Virtual Workstation Environments http://on-demand.gputechconf.com/gtc/2016/video/S6218.html
  • S6608 - Virtualize Linux 3D Applications with Citrix HDX 3D Pro http://on-demand.gputechconf.com/gtc/2016/video/S6608.html


super community contribution… bit beyond my technical skills but I know the staff at NVIDIA and customers will love this!

Really outstanding work and presentation, Martin!!! Thanks ever so much for sharing all this. BTW, brilliant idea to wire into the mouse button directly. :-)

Great work Martin best contribution from a community I have seen for a very long time.

/Thomas Poppelgaard

I had some spare time and some ideas and hints for better observability (https://www.raspberrypi.org/forums/viewtopic.php?f=70&t=148390) therefore I prepared some measurement of video stream.
There are collected latencies of video stream from point (7) to (13)/(D) (see the first post). The video frame is received from vGPU (expected every 33ms), encoded to h264 (nvenc on K2200 in separated DomU), transferred to RPI (UDP/RTP with traffic shaping HTB), decoded on RPI (OpenMAX) and posted to HDMI (expected @ 60FPS, vsynced every 16ms). The following graphs shows 500 frames (eg. ~16.6 seconds @ 30 FPS) under different situation (idle 1280x1024@30, load 1280x1024@30, load 1920x1080@30).

  • I. Idle windows desktop 1280x1024@30

  • II. Idle windows desktop 1280x1024@30 (stacked latencies)

  • III. Windows with Unigine Heaven 1280x1024@30 OpenGL

  • IV. Windows with Unigine Heaven 1280x1024@30 (stacked latencies)

  • V. Windows with Unigine Heaven 1920x1080@30 OpenGL

  • VI. Windows with Unigine Heaven 1920x1080@30 (stacked latencies)

Some comments:

  • The vGPU (7) should deliver events (vmiop_dt_frame) in regular manner (asked for 33ms). The measured time between events is "frame delta". The event delivery sometimes oscillates (a) and it is very unpredictable under vGPU load (g) and sometimes it works OK (e).
  • The repeated (every 30 frames) regular peaks in measurement (b) are inserted H264 GOP/IDR (restart info for full resync decoder). I am experimenting with RTCP SR/RR feedback mechanism to insert GOP/IDR on demand after frame lost or repeating lost frames with RTCP NACK.
  • H264 encoder (nvenc on K2200) is very fast and constant (d) "encode latency" (8)-(10) (~5ms for 1280x1024 and ~9ms for 1920x1080). Nvenc setup with NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID, NV_ENC_PARAMS_RC_CONSTQP and Q=28.
  • H264 decoder (OpenMAX on RPI) is slower (c) "decode latency" (12) (~12ms under idle for 1280x1024, ~20ms under load for 1280x1024 and reaching RPI limit ~25ms under load for 1920x1080).
  • HDMI vsync wait is measured with RPI vc_dispmanx_vsync_callback() "vsync latency" (13)/(D). The saw-waveform (f) comes from different clock domains (vGPU @ ~30 FPS and HDMI @ ~60 FPS).
  • There is additional receiving latency in receiver under load (h) "receive latency" (11). The H264 encoded frames are large (see "frame size" and the right y-axis) and RPI has only 100Mbit/s ethernet ! The network delay (only one switch) is insignificant. RTCP SR/RR reports low latency roundtrip network (<1ms).
  • The drop in traffic (i) is the scene switch in Unigine Heaven and it is OK.
  • The drop in traffic (j) is the bug in vGPU driver. The driver randomly repeat last frame instead of new one under load condition (OpenGL). The worse situation is in DX9 when driver sometimes send older frames (like 1-2-3-4-5-3-4-5-9-10-....). The worst is the DX11 that totally crippling full screen frames, randomly freezing for many frames (effectively to ~4 FPS) and more problems.
  • The "stacked" graphs (II. + IV. + VI.) accumulates all measured latencies (7)-(13)/(D) (idle <35ms, load 1280x1024 <50ms and load 1920x1080 <65ms). The complete "click-to-photon" latency has additional USB latency (~5ms+~10ms), vsync in vGPU (0-33ms) and LCD monitor latency (input lag ~5ms and LCD transition ~4ms) (see previous posts).
  • There is possibility to lower latency with RPI overclocking (nearly 40% less latency of decoding).