nvidia using wbinvdt() < an Intel Instruction > causes huge latency spikes

Hi nvidia devs, users and friends.

Recently, i have been (further) investigating performance issues / hacking on nvidia (to improve the driver for PREEMPT_RT_FULL use. ie: ‘realtime linux’.)

Through a combination of tests and googling - the source of the issue comes down to the use of winvdt(); instruction; http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc323.htm

WBINVD flushes internal cache, then signals the external cache to write back current data followed by a signal to flush the external cache. When the nvidia calls the wbinvd instruction, that invalidates the caches of ALL CPUs, forcing them to flush the caches and read everything again. ~ this literally stalls all of the cpus -> leading to fairly substantial latencies / poor performance, on a system that otherwise should be quite deterministic. (and is deterministic, when not making that call).

On a (vanilla) linux kernel the problem is less apparent, but still present. On an PREEMPT_RT_FULL system (where latency is critical) nvidia ends up choking the entire system. Here is the patch i am using to get around the issue;

diff -Naur a/nv-linux.h b/nv-linux.h
--- a/nv-linux.h	2013-12-03 23:24:48.484495874 +0100
+++ b/nv-linux.h	2013-12-03 23:27:44.684030888 +0100
@@ -392,8 +392,13 @@
 #if defined(NVCPU_X86) || defined(NVCPU_X86_64)
+#if 0
 #define CACHE_FLUSH()  asm volatile("wbinvd":::"memory")
 #define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
+#define CACHE_FLUSH()
+#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
 #elif defined(NVCPU_ARM)
 #define CACHE_FLUSH() cpu_cache.flush_kern_all()
 #define WRITE_COMBINE_FLUSH()   \
diff -Naur a/nv-pat.c b/nv-pat.c
--- a/nv-pat.c	2013-12-03 23:24:33.987007640 +0100
+++ b/nv-pat.c	2013-12-03 23:26:57.615744800 +0100
@@ -34,7 +34,9 @@
     unsigned long cr0 = read_cr0();
     write_cr0(((cr0 & (0xdfffffff)) | 0x40000000));
+#if 0
     *cr4 = read_cr4();
     if (*cr4 & 0x80) write_cr4(*cr4 & ~0x80);
@@ -43,7 +45,9 @@
 static inline void nv_enable_caches(unsigned long cr4)
     unsigned long cr0 = read_cr0();
+#if 0
     write_cr0((cr0 & 0x9fffffff));
     if (cr4 & 0x80) write_cr4(cr4);

As you can see, i have it disabled (when building for RT kernels) in my build of the nvidia driver. From what i am told, the Intel OSS driver on linux 3.10 also had this problem - however, they have removed that code / corrected the problem. (I am using linux 3.12.1 + rt patch)

I am wondering if anyone at nvidia could tell me if this might be a reasonable workaround (maybe even suitable for inclusion for -rt users?) and/or is there any work being done in this are, or are nvidia linux devs aware that this is a (potential) problem on mainline linux and leads to horrible performance on any linux system with ‘hard-realtime’ requirements?

it would be nice to get some feedback on this * for me, it appears safe - two days of torturing nvidia on linux-rt, no real problems.

—[Well, with one exception; the semaphore code in nvidia, when used on linux-rt does lead to some scheduling bugs (but is otherwise non-fatal). That being said, they can be replaced by mutexes, but that is a different topic alltogether.]—

If anyone wants to look at the patch(es) or test to verify - you can download my Archlinux package and extract the patches needed and apply it to your own nvidia driver (*requires a PREEMPT_RT_FULL kernel). the patches apply over nvidia-331.20 - but the wbinvd problem exists in ALL versions of nvidia. Package/tarball here; https://aur.archlinux.org/packages/nvidia-l-pa/

You will need to apply these two patches;

  • nvidia-rt_explicit.patch (sets PREEMPT_RT_FULL)
  • nvidia-rt_no_wbinvd.patch (disables wbinvd for PREEMPT_RT_FULL).
  1. cd into /kernel (sub-folder of nvidia driver/installer)

  2. apply the (2) above patches

  3. make IGNORE_PREEMPT_RT_PRESENCE=1 SYSSRC=/usr/lib/modules/"${_kernver}/build" module

  4. install the compiled binary

  • Don’t ask me for distro-specific help - I only use Archlinux (which i DO package for).

You can verify what i am talking about, by using a tool that can measure latency; I use Cyclictest, which is a part of the ‘rt-tests’ for linux-rt; https://rt.wiki.kernel.org/index.php/Cyclictest - you will see huge latency spikes when launching videos (on youtube for example) and possibly when using things like CUDA. disabling the calls results in no spikes.

It would be nice if nvidia found away to avoid this call all together, as Intel OSS developers have done.

BTW - * The last patch [nvidia-rt_mutexes.patch] has nothing to do with the WINVD issue. - that’s for converitng semaphores in nvidia with mutexes. -> which i’m still testing, hence it isn’t even enabled in my Archlinux package). It needs review, but i thought i would hit the linux-rt list to get help there - as i am not a programmer, but i do hack / understand some coding/languages to varying degrees.

any insights, help, feedback would be nice as i would like to avoid wbinvdt() calls on linux-rt / see nvidia improve their driver.



No takers eh?

I still have yet to see any issues with no_wbinvdt patch enabled. I asked on linux-rt-user list, but only got a bit of feedback, from a user, NOT a developer - so that wasn’t exactly helpful. I hope someone at nvidia reads my post and looks into / replies here.

OT: but here is hte kinda thing you see on linux-rt (3.12.1-rt currently, but happens all throughout 3.x-rt series), when nvidia is using semaphores;

[197972.079574] BUG: scheduling while atomic: irq/42-nvidia/18410/0x00000002
[197972.079596] Modules linked in: nvidia(PO) snd_seq_midi snd_seq_midi_event snd_seq_dummy snd_hrtimer snd_seq isofs fuse joydev hid_generic usbhid snd_usb_audio snd_usbmidi_lib hid snd_rawmidi snd_seq_device wacom snd_hda_codec_hdmi forcedeth snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd soundcore edac_core edac_mce_amd k10temp evdev serio_raw i2c_nforce2 video wmi asus_atk0110 button processor drm i2c_core microcode ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic pata_acpi ahci libahci pata_amd ohci_pci ohci_hcd ehci_pci libata ehci_hcd firewire_ohci firewire_core crc_itu_t scsi_mod usbcore usb_common [last unloaded: nvidia]
[197972.079599] CPU: 3 PID: 18410 Comm: irq/42-nvidia Tainted: P        W  O 3.12.1-rt4-3-l-pa #1
[197972.079599] Hardware name: System manufacturer System Product Name/M4N75TD, BIOS 1701    04/14/2011
[197972.079601]  ffff8801efb5fa70 ffff8801efb5f910 ffffffff814ebb7d ffff88022fcd1b40
[197972.079602]  ffff8801efb5f920 ffffffff814e8f00 ffff8801efb5fa28 ffffffff814eeb49
[197972.079603]  0000000000011b40 ffff8801efb5ffd8 ffff8801efb5ffd8 0000000000011b40
[197972.079603] Call Trace:
[197972.079608]  [<ffffffff814ebb7d>] dump_stack+0x54/0x9a
[197972.079609]  [<ffffffff814e8f00>] __schedule_bug+0x48/0x56
[197972.079611]  [<ffffffff814eeb49>] __schedule+0x629/0x7e0
[197972.079613]  [<ffffffff814f0d4e>] ? _raw_spin_unlock_irqrestore+0xe/0x60
[197972.079615]  [<ffffffff810bba55>] ? task_blocks_on_rt_mutex+0x1f5/0x260
[197972.079616]  [<ffffffff814eed2a>] schedule+0x2a/0x80
[197972.079618]  [<ffffffff814efc2b>] rt_spin_lock_slowlock+0x177/0x2ac
[197972.079726]  [<ffffffffa1538401>] ? _nv014994rm+0x1395/0x25f4 [nvidia]
[197972.079732]  [<ffffffff814f08a5>] rt_spin_lock+0x25/0x40
[197972.079734]  [<ffffffff81083129>] __wake_up+0x29/0x60
[197972.079768]  [<ffffffffa17468dd>] nv_post_event+0xdd/0x120 [nvidia]
[197972.079807]  [<ffffffffa171e369>] _nv013270rm+0xed/0x144 [nvidia]
[197972.079843]  [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia]
[197972.079906]  [<ffffffffa1433868>] ? _nv005358rm+0xbe/0xe7 [nvidia]
[197972.079968]  [<ffffffffa1433b42>] ? _nv012422rm+0xdf/0xf8 [nvidia]
[197972.080035]  [<ffffffffa1433ac2>] ? _nv012422rm+0x5f/0xf8 [nvidia]
[197972.080107]  [<ffffffffa15721ec>] ? _nv009896rm+0xb0d/0xd40 [nvidia]
[197972.080182]  [<ffffffffa1572204>] ? _nv009896rm+0xb25/0xd40 [nvidia]
[197972.080235]  [<ffffffffa1607a73>] ? _nv011894rm+0x4df/0x709 [nvidia]
[197972.080286]  [<ffffffffa160618a>] ? _nv001242rm+0x21e/0x2a7 [nvidia]
[197972.080337]  [<ffffffffa1606570>] ? _nv011911rm+0x3d/0x14b [nvidia]
[197972.080372]  [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia]
[197972.080422]  [<ffffffffa160da98>] ? _nv011891rm+0x38/0x59 [nvidia]
[197972.080459]  [<ffffffffa172127a>] ? _nv000818rm+0xcd/0x133 [nvidia]
[197972.080492]  [<ffffffffa1725691>] ? rm_isr_bh+0x23/0x73 [nvidia]
[197972.080523]  [<ffffffffa1743a1b>] ? nvidia_isr_bh+0x3b/0x60 [nvidia]
[197972.080525]  [<ffffffff81055a89>] ? __tasklet_action.isra.11+0x69/0x120
[197972.080526]  [<ffffffff81055bfe>] ? tasklet_action+0x5e/0x60
[197972.080527]  [<ffffffff8105555c>] ? do_current_softirqs+0x19c/0x3a0
[197972.080529]  [<ffffffff810a6300>] ? irq_thread_fn+0x60/0x60
[197972.080530]  [<ffffffff810557be>] ? local_bh_enable+0x5e/0x80
[197972.080531]  [<ffffffff810a633b>] ? irq_forced_thread_fn+0x3b/0x80
[197972.080532]  [<ffffffff810a657f>] ? irq_thread+0x11f/0x160
[197972.080533]  [<ffffffff810a65c0>] ? irq_thread+0x160/0x160
[197972.080534]  [<ffffffff810a6460>] ? wake_threads_waitq+0x60/0x60
[197972.080536]  [<ffffffff81074f12>] ? kthread+0xb2/0xc0
[197972.080537]  [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0
[197972.080539]  [<ffffffff814f8bac>] ? ret_from_fork+0x7c/0xb0
[197972.080540]  [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0
[ninez@localhost ~]$

…with mutexes replacing the semaphore code, you don’t see this kind of ugliness in the kernel ring bufffer.


Could someone from nvidia PLEASE! f-ing respond to my post!

I don’t expect any users will be able to clarify the wbinvdt() issue and i would like some info from someone ‘in the know’… It’s freaking annoyinf that as a long-time nvidia (linux) user/customer - who buys nvidia for every PC and recommends nvidia to others - that i can’t even get ONE post answered in these developer forums.

it really makes me question whether i should continue to take my hard-earned cash and spend it on nvidia, when i could be giving it all to Intel ~ who ARE helpful, DO respond to inquires, etc…

(so far) thanks for nothing.

I’m sure your loyalty is appreciated but to get the sort of feedback you’re looking for is unheard of here – not just in the driver forum but across all of the dev forums. To get that kind of information you would need to engage an engineer that you know is already overworked and can’t spare the time to read these forums religiously and respond to every post. There are even serious bug issues that get reported here and don’t get any attention (publicly).

Maybe things are different at Intel but, in the end, I’m not sure that they’re making a better product.

I suggest that you might have more success getting an engineer’s attention through a non-public method, e.g. through a proxy.

ninez: Have you tested your patch with some of CUDA’s more unusual memory options? Namely, write-combined memory? I think CUDA may also support uncached host-side memory. Take a look at kernel/nv-vm.c: You’ll see uncached pages used all over the place (and calls to flush the CPU cache). I get the feeling that what you’re doing could be dangerous for some applications. Cache inconsistencies can sometimes be really hard to trigger/detect.

Do you have sense for what the driver is doing when it calls wbinvd? Is it in an interrupt handler? Tasklet? Workqueue item? System call (e.g., ioctl())?

@Arakageeta - I just saw your message (very helpful, thanx), but unfortunately it’s early morning here - and i am just about to leave for work, so i don’t have time to get into this right now… I’ll have a (more detailed) look through vn-vm.c when i get home + try to get some other details together.


EDIT: this may take me a few days to get back to; forgot it was my Father’s b-day yesterday && the holiday season / shopping for all the kids in my family / extended family is not only cracking my piggie bank - but also taking up most of my time (for the next few days anyway). but I’ll free up some time soon, to sort this out.

I’ve applied the patch and I haven’t seen any difference on my system on non-RT desktop with AMD CPU + 650GTX. And I’m suffering from some terrible lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.

ninez, you are not alone in frustration with nVidia. They seem to have spoken clearly as of late that their priorities are elsewhere. I have old hardware (8800GTS or 9400GT) that works flawlessly and smoothely with old drivers. Then you get Kepler, and it’s a terrible experience. And it is even easy to duplicate lags.

  1. install wine
  2. install trial of Eve Online
  3. start it up, undock and “warp” your ship or just look around.

And the only difference between smooth experience there with old video card and jagged, unusable one is the video card + drivers. With older drivers, it even produced Xid driver errors 59 and 8 and others. nVidia fixed those errors - took them a year to address that. …

Is this an graphics or CUDA application?

Graphics. I suspect something finicky with some shaders, but that is speculation at present.

For computation, I don’t use CUDA, just simple OpenCL kernels (program) and I’ve had no problems with OpenCL.

Hey guys - Sorry for the extremely late reply. My holiday season was an absolute mess;

  • massive Ice storm, no power for several days over the holidays
  • huge amount of cleanup + helping sort out other friends/families issues
  • problems at work due to some nasty data loss (actually H/W failure).
  • (and now) i am sick as a dog; which ironically, should free me up some time, over the weekend / into next week. hopefully.

I’m @ work today - but I’m pretty sure i will be taking off several days next week, since i am almost ‘caught up’ @ work… and as the person who i got this bug from has been extremely sick for well over a week, i doubt i will be going into work (next week), if i can avoid it.

anyway, I’ll try to set aside some time to delve further into this, asaic.

The CACHE_FLUSH macro is used in nv-vm.c, which contains this comment:

 * Cache flushes and TLB invalidation
 * Allocating new pages, we may change their kernel mappings' memory types
 * from cached to UC to avoid cache aliasing. One problem with this is
 * that cache lines may still contain data from these pages and there may
 * be then stale TLB entries.
 * The Linux kernel's strategy for addressing the above has varied since
 * the introduction of change_page_attr(): it has been implicit in the
 * change_page_attr() interface, explicit in the global_flush_tlb()
 * interface and, as of this writing, is implicit again in the interfaces
 * replacing change_page_attr(), i.e. set_pages_*().
 * In theory, any of the above should satisfy the NVIDIA graphics driver's
 * requirements. In practise, none do reliably:
 *  - most Linux 2.6 kernels' implementations of the global_flush_tlb()
 *    interface fail to flush caches on all or some CPUs, for a
 *    variety of reasons.
 * Due to the above, the NVIDIA Linux graphics driver is forced to perform
 * heavy-weight flush/invalidation operations to avoid problems due to
 * stale cache lines and/or TLB entries.

I’ll defer to the kernel experts, but my understanding is that this is required to avoid problems with cache consistency, which can be extremely difficult to track down. I’m sorry that this introduces problems with your -rt-patched kernel’s latency guarantees. This is part of why -rt kernels are not officially supported.

1st. Thanks for replying aplattner. I was actually surprised to see the ‘illusive’ olive green background, of an nvidia dev’s comment ;)

Yes, I have come across that bit. (in recent days, and have been reading up a little on cache coherence / consistency). It’s interesting, because afaict, none of my systems have had issue removing that call. (all system’s MOBOs support nvidia (like core calibration, sli, etc) both SMP (one 4 core, one 8 core) and are running PREEMPT_RT_FULL…and i’ve never had this smooth / great performance / determinism.

Yeah, i got that impression from reading the code comment - obviously, i had been running the patched nvidia (long) before seeing that bit ~ and hadn’t/haven’t noticed any problems. I wonder, do you have any suggestions; as to how to go about tracking down problems with cache consistency? In the (weeks?) that i have been running nvidia without wbinvdt() i haven’t experienced any odd behavior. (aside from my system working as it should). I’ve run some (linux-related) diagnostic tools, cuda_memtest, all cuda examples/checks. my H/W accel VMs in vmware work great, unigine-* GFX benchmarks (and others) work fine… It would really be nice to NOT have to go back to a driver that introduces severe latency spikes ;)

I wonder if using mutexes is having any impact here(?) (mutexes imply more than just synchronization, also memory barriers, correct opdering, etc. afaict/understand…). I also use a few other non-standard (linux) bits, like UKSM (in-kernel deduplication of memory), a few other tweaks like MAX_READAHEAD multiplied/increased(MM subsystem) and linking to ld.bfd instead of ld.gold for nvidia (and obviously. no wbinvdt, mutexes, etc)… maybe that is grasping at straws, but i would tend to think, if wbinvdt() is very critical, why haven’t i experience any problems? (only benefits).

anyway, it would be nice if you defer it to someone who is more in the know. Maybe there is a better way / different way to handle the caches, that could be worth exploring… this i don’t know. but thnks regardless.

Hi, Arakageeta - Cuda on my system without wbinvdt() works just fine; All of the examples are smooth, i pass all tests, etc… i didn’t have a problem with any of them. :) I also downloaded some other cuda apps/demos from around the web to test, aside from the odd one that didn’t compile (code rot, most likely), all of them worked great. ~ the only observation that i had was that certain cuda demos put a little ‘strain’ on Compiz, * but only involved windows moving slightly slower - that’s it. (stock/unpatched kernel/driver does the same thing for me).

I’ve found some tools for stress testing (including testing cache coherency, among other things). Right now i am using Google’s “Stressful Application Test”; http://code.google.com/p/stressapptest/ . On the cache cohernecy test(s) - i get a PASSING grade, zero errors/issues. - I also went ot the trouble of using some CUDA at the same time, then VMware after that (rerun test during each use) - still i get a ‘passing grade’ / my caches are fine…

Since, i am sick and not working - i am going to use a nice chunk of the day to see if i can find any test / or debugging mechanism in the kernel that will actually report an issue with cache coherency or consistency.

EDIT: I’ve added rdtsc usage and cachegrind(part of valgrind) to my list of tools. so far so good. (in any tests that i have run). I believe Intel’s Vtune should potentially be useful too, except while i do have their compiler suite installed - i don’t think i ever actually got Vtune. (on my list of things to do today).

@aplattner: This has been NVIDIA’s policy for quite some time, but it may have to change: SteamOS (at least the beta) runs the PREEMPT_RT Linux kernel. (NVIDIA could easily not fix ninez’s bug and still claim full support—this is a latency issue, not a functional issue.)

@ninez: I’m glad that your patch seems stable. However, “exhaustive” testing such as this can only give you warm-fuzzies about the patch on your particular software/hardware system. How do you know what you’re doing is correct for other CPU models, each operating at different speeds, and with different clock ratios between the CPU and various buses? Cache errors may be the most vile and hard-to-diagnose race conditions out there. I think what you’re doing is great. I’m glad your sharing your code and experience with everyone. But as for a general solution, merely removing wbinvdt() sounds very dangerous. Is there anything that can replace the instruction instead of remove it? I’ll grep around the Linux kernel to see how they handle the situation…

The code comments aplattner posted from nv-vm.c talks about change_page_attr(). Digging into arch/x86/mm/pageattr.c (3.0.x kernel), we find this function calls cpa_flush_*():

     * On success we use clflush, when the CPU supports it to
     * avoid the wbindv. If the CPU does not support it and in the
     * error case we fall back to cpa_flush_all (which uses
     * wbindv):
    if (!ret && cpu_has_clflush) {
        if (cpa.flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
            cpa_flush_array(addr, numpages, cache,
                    cpa.flags, pages);
        } else 
            cpa_flush_range(baddr, numpages, cache);
    } else 

Seems like this code is sensitive to CPU capabilities. It appears that cpa_flish_range/array() call a lighter-weight method of cache invalidation: clflush() (see arch/x86/include/asm/system.h) instead of wbinvdt().

If we assume Linux is working properly, then there is no need to flush the cache in nv-vm.c::nv_flush_cache(). NVIDIA says it doesn’t always work, hence the wbinvdt(). Unfortunately, nv-vm.c doesn’t give us any more information.

Here are the code comments in nv-vm.c from an older 27x-era driver. They’re a little different:

 * Cache flushes and TLB invalidation
 * Allocating new pages, we may change their kernel mappings' memory types
 * from cached to UC to avoid cache aliasing. One problem with this is
 * that cache lines may still contain data from these pages and there may
 * be then stale TLB entries.
 * The Linux kernel's strategy for addressing the above has varied since
 * the introduction of change_page_attr(): it has been implicit in the
 * change_page_attr() interface, explicit in the global_flush_tlb()
 * interface and, as of this writing, is implicit again in the interfaces
 * replacing change_page_attr(), i.e. set_pages_*().
 * In theory, any of the above should satisfy the NVIDIA graphics driver's
 * requirements. In practise, none do reliably:
 *  - some Linux 2.4 kernels (e.g. vanilla 2.4.27) did not flush caches
 *    on CPUs with Self Snoop capability, but this feature does not
 *    interact well with AGP.
 *  - most Linux 2.6 kernels' implementations of the global_flush_tlb()
 *    interface fail to flush caches on all or some CPUs, for a
 *    variety of reasons.
 * Due to the above, the NVIDIA Linux graphics driver is forced to perform
 * heavy-weight flush/invalidation operations to avoid problems due to
 * stale cache lines and/or TLB entries.

Here, the comments state that the 2.6 kernel only needs a TLB flush. This implies to me that commenting out the call to the CACHE_FLUSH() macro in nv_flush_cache() should be safe.* I think this is a better solution than changing CACHE_FLUSH() into a noop. Why did the comments in nv-vm.c change? Did an engineer get overly zealous in cleaning up comments when AGP or 2.4 kernel support was dropped? Did NVIDIA learn of other instances where the 2.6 kernel also needed a cache flush? We’ll never know.

I think the best that you can do is register a bug with NVIDIA and hope that they task an engineer to reevaluate the situation. This is such a low-level and fundamental part of memory management that I could see NVIDIA being very (extremely) hesitant to making any official changes unless wbinvdt() starts to create serious problems for important customers (AAA games on Linux). I’m not surprised that you’ve hit a bug that relates to latency: the Linux driver hasn’t really had to support low-latency operations until recently. Hopefully SteamOS will help motivate a change.

In the meantime, I think the best that you can do is test your system with CACHE_FLUSH() commented out from nv_flush_cache() and hope for the best. I’ll test it out on my system as well.

  • It should be safe, assuming that the nv_flush_cache() callee is calling nv_flush_cache() because it changed memory attributes via change_page_attr() or set_pages_*() and not for some other reason. This appears to be the case: We can limit the scope of code that needs to be reviewed to nv-vm.c since nv_flush_cache() is a static function.


Agreed, that simply making the it a noop, wasn’t necessarily the best plan - to get to testing, it certainly has been helpful. ;)

Over the last couple of days (since last posting), i have yet to find any problems on any of my machines - but obviously, as you point out - that doesn’t mean it would work across the board/any system. (but does appear to be fine here, on my machines, which obviously for me, is the most important). That is why i hit the dev forums, as i would like a solution that does work for everyone and obviously, in my limited experiences; i wanted/needed more input from external sources. (nvidia, you, etc).

I will give your ‘CACHE_FLUSH()-nv_flush_cache()-commented-out idea’, a try (probably, in a few minutes - but i’ll get back to you, tomorrow maybe the next day - just so i can do some rigorous testing in).

As far as reporting as a bug; i can’t really, as aplattner pointed out - linux-rt is ‘unsupported’. + We have cofirmed with nvidia (via him) that they are aware of the problem. ~ so hopefully, (publicly or not) they can look into it too.



I tested out your idea (of nv-vm.c), obviously, it introduces some of the latency spikes, but to a lesser degree. - meaning nv-vm.c is not the source of all pain. (since, wbinvdt() is used in other places).

I’ve also upgraded to 331.38, so i had to kinda test the whole thing again + update my nvidia-rt_mutexes.patch against that driver. (a few bits have changed here and there. 331.38 seems pretty good though).

anyway, I think next i want to try to use an alternative to wbinvdt(), as was discussed before. So i guess that is probably the next logical step. Maybe this weekend, i can get to that part…

EDIT: btw, if you have any input / thoughts - feel free to interject - as i am finding your insights to be quite useful / helping my own learning/understanding of this stuff ;)


I also tried out your changes in 331.38 on with CACHE_FLUSH() commented out in nv-vm.c. I haven’t done any rigorous testing though. I just ran a handful of programs from the CUDA examples. Everything appears to be okay. I’m on a dual-socket Westmere system with Quadro K5000s.

I’m interested in testing UVM in CUDA 6.0 with your patch. As I understand it, UVM does some tricky things with CPU/GPU page tables to create a unified memory space. I suspect that UVM will hit the code paths (or similar paths) that we have discussed very hard. Looping over UVM-managed memory from both the CPU and GPU simultaneously may induce quite a bit of stress. It may also be a good test-case for exposing the latencies that you have reported.


I’m curious about CUDA 6.0, as well. --(well, my cuda driver is 6.0)-- But i guess we’ll have to wait until the 6.0 Runtime is available. I wonder if their are any 5.5 UVM examples/tests around? (if any of the cuda samples use that, the only one’s i couldn’t run where major/minor 3.5 samples, IIRC. ~ which is also why I’m thinking about picking up a newer nvidia card soonish, if i can find a good deal / at the right time… My GT 440 works really well, but i could probably pickup a current mid-range card that would allow me to pick up / test some newer features.

cuda’s deviceQuery;

$ '/opt/cuda/samples/bin/x86_64/linux/release/deviceQuery' 
/opt/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 440"
  CUDA Driver Version / Runtime Version          6.0 / 5.5
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1023 MBytes (1072889856 bytes)
  ( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Cores
  GPU Clock rate:                                1620 MHz (1.62 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce GT 440
Result = PASS

Regardless, for now, this card gets the job done on this particular system. --(no issues whatsoever on linux-rt, well, not when i’m using my patched version anyway.)–.

Just to add some more random noise, I used the patch mentioned here on my GTX 660 and 331.38 for some weeks now and it runs fine. The card is mostly used for gaming with wine, no CUDA.