Computing with Geforce CUDA cards

After reading many threads, i’ve decided to venture here and attempt to find a straight answer.
I am trying to use a Geforce GTX 570 purely as a CUDA device to run simulations.By now i’ve figured out:

  • Windows 7 WDDDM feature resets my card if my kernel takes to long to execute
  • You can change the "watchdog" timer in registry keys
  • Geforce cards are not TCC capable.
  • BIOS will not power on my card if i use integrated motherboard graphics and is unrecognizable in device manager
  • Having two CUDA cards installed may or may not work.

My questions are:

  • Should Kernels naturally be short processes? (i.e. loop kernel calls instead of one big kernel call that would take a few minutes)
  • Why would Nvidia advertise their geforce cards as CUDA capable cards if theres no way to simulate for longer than 5 secs?
  • If i indeed had two cards, how would you set it up to where WDDM wouldnt "watch" the CUDA card?

Answers would be greatly appreciated!

Thanks!

Should Kernels but naturally short processes? (i.e. loop kernel calls instead of one big kernel call that would take a few minutes)

There’s no particular reason to make this a general statement unless you are trying to work around the WDDM TDR mechanism (watchdog).

Why would Nvidia advertise there geforce cards as CUDA capable cards if theres no way to simulate for longer than 5 secs?

You can simulate for longer than 5 seconds if you really want to. Disable the TDR watchdog. And under linux it’s relatively straightforward to use GeForce cards for arbitrary length simulations. The issue is really a limitation specific to Windows/WDDM.

If i indeed had two cards, how would you set it up to where WDDM wouldnt “watch” the CUDA card?

It’s difficult to do if both cards are GeForce cards. If one of the cards is a Quadro or Tesla GPU, put that card in TCC mode, and use the other (GeForce) card as your display.

Thanks for the reply txbob!

  • Kernel size is arbitrary. check!
  • This question pertains to cudaLaunchTimeoutError error_t. In windows this will be trigger if execution is longer than 5-7 secs ( according to my research). I dont know one thing about linux and develop using VS2010.
  • I do not have the resources to purchase a quadro or Tesla GPU. My research is to prove that it is worthwhile.
  • There might be an option in your BIOS that permits booting from the Intel iGPU and permits installing NVIDIA drivers under Windows for your GeForce card – I’ve certainly done this using an ASUS motherboard… not sure what specific board you have.

    Something to note… for reasons that are not immediately clear to me at the moment, a CUDA compute-bound port of a double-precision inverse imaging algorithm I am using runs at least twice as fast under Ubuntu 13.10 x64 using a Quadro K6000 compared to Windows 7 x64 under TCC mode – same hardware. First noticed this while running the same algorithm under a different Linux box that has a GTX Titan. So be aware that even O/S choice can make quite a performance difference. That being said it may be that this is related to how it was compiled, rather than O/S selection… but that’s a subject for another thread…

    Disable the TDR watchdog.

    See https://devtalk.nvidia.com/default/topic/535264/cuda-programming-and-performance/kernel-runs-fine-on-osx-linux-crashes-on-windows-/post/3762516/#3762516 for a way to disable TDR. You can also do the same thing via an NVIDIA Windows app called NSight Monitor:


    (This is from an older version, but similar in recent one)
    You’ll have to reboot for the changes to take effect.

    vacaloca:
    Thank you for the response and information. My system has a DELL BIOS which is very limited. I’ve looked into it but don’t seem to notice anything you were mentioning.

    I have however disabled the watchdog timer.

    Correct me if i’m wrong but, it seems my only option to run CUDA on a geforce card is to make the system unresponsive until the kernel completes. How would one justify the purchase of say a GTX Titan, which is designed for compute mode, for a windows system? Is there no way to actually force windows to run a geforce card purely as a cuda device?

    vacaloca:
    Thank you for the response and information. My system has a DELL BIOS which is very limited. I’ve looked into it but don’t seem to notice anything you were mentioning.

    I have however disabled the watchdog timer.

    Correct me if i’m wrong but, it seems my only option to run CUDA on a geforce card is to make the system unresponsive until the kernel completes. How would one justify the purchase of say a GTX Titan, which is designed for compute mode, for a windows system? Is there no way to actually force windows to run a geforce card purely as a cuda device?

    vacaloca:
    Thank you for the response and information. My system has a DELL BIOS which is very limited. I’ve looked into it but don’t seem to notice anything you were mentioning.

    I have however disabled the watchdog timer.

    Correct me if i’m wrong but, it seems my only option to run CUDA on a geforce card is to make the system unresponsive until the kernel completes. How would one justify the purchase of say a GTX Titan, which is designed for compute mode, for a windows system? Is there no way to actually force windows to run a geforce card purely as a cuda device?

    In practice CUDA applications usually use kernels with run times that are only fractions of a second. In other words, CUDA-enabled applications typically make many calls to short-running kernels. The longest running kernel in a real-life application I have worked on runs about two seconds on a K20 (same performance class as Titan); this does not mean there may not be real-life applications that run longer kernels.

    For practical applications you could look at the many distributed computing projects that support and utilize NVIDIA GPUs on Windows. One example would be Folding at Home which runs across 100K+ machines resulting in total computational performance in the tens of PetaFLOPS. On my Windows7 machine at home, which has a low-end sm_21 graphics card, running the FaH client results in sluggish GUI response, but otherwise it works just fine.

    Has anyone suggested simply running the compute GPU without a monitor (headless)?

    That seems to work fine under Windows 7.

    allanmac,

    I have tried that setup but the BIOS (or Windows) will not power the card unless it is configured as the primary video output. i have tried using integrated graphics and using two cards. if there is a way i’m all ears.

    allanmac,

    I have tried that setup but the BIOS (or Windows) will not power the card unless it is configured as the primary video output. i have tried using integrated graphics and using two cards. if there is a way i’m all ears.

    So even if you run the latest Windows 64-bit driver installer and reboot it doesn’t recognize your GPUs?

    If you run GPU-Z you should be able to see both connected and headless GPUs. The CUDA Samples should all run fine.

    If you can’t see your GPU(s) then I’m not sure what’s going on.

    What’s your motherboard and BIOS revision?

    [ As an example, I have an Ivy Bridge ITX board with the Intel gfx as its display and a headless GTX 750 Ti for dev. Seems to work fine under Win7/x64 after a little effort to get Windows to recognize which device was active. I temporarily plugged cables into the IGP and GPU and then into the same monitor and it booted fine. I’ve never had to do that since. ]

    Yes i have the latest driver from Nvidia.

    On the IGP setup:

    When booting BIOS alerts me there is a dedicated GPU and suggests i Use that as primary. If i continue using the IGP as the visual processor (Headless GPU) then the card is unrecognized in windows device manager under display devices. It also unrecognized by cudaGetdevice. The BIOS for this setup isa version of DELLs

    On the multiple-GPU setup:

    one card will throw a TDR error and reset the driver while using the other card will throw a cudaLaunchTimeOut.

    It sure sounds like the Dell BIOS is at fault. You should ask Dell to fix it (good luck!). My workstation has 5 GPUs and 3 are headless and the BIOS doesn’t complain – I think that’s the normal situation for most mobos.

    One option might be to try booting with both GPUs connected to a monitor or use a dummy plug.

    If you can get past the BIOS then running headless might work.

    Just a guess…

    @K_launcher,

    Unless you can update your BIOS and enable the functionality, I suspect you will have to add a secondary GPU to drive your display. I am assuming this Dell system is quite old… most of the older Dells do not have the capability to use PCI-E video card unless you plug in the display source to it. I recall it alerts you upon POST that it sees an external card and you cannot plug in the video to the IGP… or some similar verbiage. Those are primarily older Core 2 Duo machines, IIRC.

    You might need a newer system to be able to use the IGP and have your NVIDIA card as secondary, or another NVIDIA or ATI card to serve as the primary display and leave the other as secondary… however this last option may not be feasible as I am assuming your system is old and probably has only 1 PCI-E slot. Maybe if you can find an old PCI video card and have the BIOS boot from it while keeping your PCI_E slot occupied by the NVIDIA card, then maybe that would work.

    Just one thing to remember when talking about Kernel lengths and WDDM - It adds a pretty significant overhead to any kernel launch. Whereas before WDDM kernel launchs were single digit milliseconds, now on one of my bits of work it is spending as much time setting up kernels as running them! This means that each kernel launch has say 50ms overhead before it even launches (it is not quite that simple with batching being done etc). This becomes a right pain when trying to do image processing involving lots of short processing kernels mixed with cuFFT calls.

    Hopefully Nvidia will finish the device-callable FFT library soon …

    May I suggest that you break up your long simulation loop into chunks, saving/restoring the current state to global memory after/before each iteration.

    The extra overhead shouldn’t be so bad and by tuning the size of your chunks you are in control of how smoothly the display reacts while computing. You can also display a fine grained progress indication because your host code will know how far into the simulation you are. It is also possible to display key metrics (intermediate results, or a live preview) during the simulation

    As far as the BIOS problems with IGP and unrecognized nVidia cards are concerned: Would a BIOS update fix the issues? Have you tried reinstalling the nVidia driver (possibly in “clean install” mode) while the nVidia card shows up as not recognized?

    Christian