Hi nvidia devs, users and friends.
Recently, i have been (further) investigating performance issues / hacking on nvidia (to improve the driver for PREEMPT_RT_FULL use. ie: ‘realtime linux’.)
Through a combination of tests and googling - the source of the issue comes down to the use of winvdt(); instruction; WBINVD--Write Back and Invalidate Cache
WBINVD flushes internal cache, then signals the external cache to write back current data followed by a signal to flush the external cache. When the nvidia calls the wbinvd instruction, that invalidates the caches of ALL CPUs, forcing them to flush the caches and read everything again. ~ this literally stalls all of the cpus → leading to fairly substantial latencies / poor performance, on a system that otherwise should be quite deterministic. (and is deterministic, when not making that call).
On a (vanilla) linux kernel the problem is less apparent, but still present. On an PREEMPT_RT_FULL system (where latency is critical) nvidia ends up choking the entire system. Here is the patch i am using to get around the issue;
diff -Naur a/nv-linux.h b/nv-linux.h
--- a/nv-linux.h 2013-12-03 23:24:48.484495874 +0100
+++ b/nv-linux.h 2013-12-03 23:27:44.684030888 +0100
@@ -392,8 +392,13 @@
#endif
#if defined(NVCPU_X86) || defined(NVCPU_X86_64)
+#if 0
#define CACHE_FLUSH() asm volatile("wbinvd":::"memory")
#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
+#else
+#define CACHE_FLUSH()
+#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
+#endif
#elif defined(NVCPU_ARM)
#define CACHE_FLUSH() cpu_cache.flush_kern_all()
#define WRITE_COMBINE_FLUSH() \
diff -Naur a/nv-pat.c b/nv-pat.c
--- a/nv-pat.c 2013-12-03 23:24:33.987007640 +0100
+++ b/nv-pat.c 2013-12-03 23:26:57.615744800 +0100
@@ -34,7 +34,9 @@
{
unsigned long cr0 = read_cr0();
write_cr0(((cr0 & (0xdfffffff)) | 0x40000000));
+#if 0
wbinvd();
+#endif
*cr4 = read_cr4();
if (*cr4 & 0x80) write_cr4(*cr4 & ~0x80);
__flush_tlb();
@@ -43,7 +45,9 @@
static inline void nv_enable_caches(unsigned long cr4)
{
unsigned long cr0 = read_cr0();
+#if 0
wbinvd();
+#endif
__flush_tlb();
write_cr0((cr0 & 0x9fffffff));
if (cr4 & 0x80) write_cr4(cr4);
As you can see, i have it disabled (when building for RT kernels) in my build of the nvidia driver. From what i am told, the Intel OSS driver on linux 3.10 also had this problem - however, they have removed that code / corrected the problem. (I am using linux 3.12.1 + rt patch)
I am wondering if anyone at nvidia could tell me if this might be a reasonable workaround (maybe even suitable for inclusion for -rt users?) and/or is there any work being done in this are, or are nvidia linux devs aware that this is a (potential) problem on mainline linux and leads to horrible performance on any linux system with ‘hard-realtime’ requirements?
it would be nice to get some feedback on this * for me, it appears safe - two days of torturing nvidia on linux-rt, no real problems.
—[Well, with one exception; the semaphore code in nvidia, when used on linux-rt does lead to some scheduling bugs (but is otherwise non-fatal). That being said, they can be replaced by mutexes, but that is a different topic alltogether.]—
If anyone wants to look at the patch(es) or test to verify - you can download my Archlinux package and extract the patches needed and apply it to your own nvidia driver (*requires a PREEMPT_RT_FULL kernel). the patches apply over nvidia-331.20 - but the wbinvd problem exists in ALL versions of nvidia. Package/tarball here; https://aur.archlinux.org/packages/nvidia-l-pa/
You will need to apply these two patches;
- nvidia-rt_explicit.patch (sets PREEMPT_RT_FULL)
- nvidia-rt_no_wbinvd.patch (disables wbinvd for PREEMPT_RT_FULL).
-
cd into /kernel (sub-folder of nvidia driver/installer)
-
apply the (2) above patches
-
make IGNORE_PREEMPT_RT_PRESENCE=1 SYSSRC=/usr/lib/modules/“${_kernver}/build” module
-
install the compiled binary
- Don’t ask me for distro-specific help - I only use Archlinux (which i DO package for).
You can verify what i am talking about, by using a tool that can measure latency; I use Cyclictest, which is a part of the ‘rt-tests’ for linux-rt; Cyclictest - RTwiki - you will see huge latency spikes when launching videos (on youtube for example) and possibly when using things like CUDA. disabling the calls results in no spikes.
It would be nice if nvidia found away to avoid this call all together, as Intel OSS developers have done.
BTW - * The last patch [nvidia-rt_mutexes.patch] has nothing to do with the WINVD issue. - that’s for converitng semaphores in nvidia with mutexes. → which i’m still testing, hence it isn’t even enabled in my Archlinux package). It needs review, but i thought i would hit the linux-rt list to get help there - as i am not a programmer, but i do hack / understand some coding/languages to varying degrees.
any insights, help, feedback would be nice as i would like to avoid wbinvdt() calls on linux-rt / see nvidia improve their driver.
cheerz
Jordan