CUDA Error - 4x 1080 Ti - Redshift Rendering issue

After about 5-10 minutes of using redshift for C4D, I’m getting a complete win 10 freeze, followed by a error from redshift below:

ASSERT FAILED File GPUComputing_CUDA.cpp Line 2402 MemCpy unnamed failed (CUDA_ERROR_LAUNCH_TIMEOUT).

It completely crashes and freezes the system, sometimes it comes back to life and only Cinema 4D crashes, sometimes it completely bombs windows 10 out.

It all used to work fine, now rendering with redshift is completely unusable. I recently upgraded from 3x 1080 Ti cards to 4. Is this a GeForce Driver issue?

GeForce drivers: 398.82 - WHQL
Using redshift 2.6.19 and Cinema 4D R19.

SPECS:
Windows 10 Home Edition
Intel Core i7-6800K Broadwell Extreme 3.4GHz
ASUS X99-E WS ECC-Ready Intel X99 Workstation ATX
64GB Corsair Vengeance LPX DDR4 3000MHz
4x 11GB NVIDIA GeForce GTX 1080 Ti GDDR5X
1500W Corsair AXi Fully Modular Digital PSU
250GB Samsung 960 Evo V-Nand PCI-e m.2 Drive

A launch timeout usually means that a kernel ran for longer than the ~2 seconds or so that is normally allowed by the WDDM subsystem.

If you have a GPU that can be used in TCC mode, that would probably help, but I don’t know if redshift can recognize and know how to use such a GPU, and your 1080Ti GPUs don’t support TCC mode anyway.

Alternatively, you could try increasing your WDDM TDR timeout. If you just google “WDDM TDR timeout” you’ll find many writeups of how to do it.

Finally, this issue should probably be addressed with the software developer. A launch timeout is not a fundamental defect or misconfiguration of your setup, nor does it represent any sort of hardware or driver problem. it is a software development issue.

Note that the “win10 freeze” may indicated a more serious software problem (hung kernel, never-ending kernel) than just a kernel that ran slightly longer than ~2s. However, even in that case, it represents a software development issue (defect in redshift) that would have to be addressed by the software developer. Apart from modifying your WDDM TDR timeout, there wouldn’t be anything you could do to try to address this, and as already stated, a timeout, by itself, does not represent a hardware or driver problem.

General principles apply, of course. You probably want to make sure your system is delivering adequate power to your GPUs, and you want to make sure the GPUs are not overheating. The recent upgrade from 3->4 GPUs suggests to me a reasonable possibility that your power supply is overtaxed.

Thank you for the fast reply, much appreciated.

I’ve been told that 1500W should be more than enough to power 4x GPUs, perhaps its not functioning correctly. The GPUs do heat up to over 80C, but ive read this is normal for rendering.
i’ll pass this onto Redshift devs and see if there is a solution available.
Will also try the WDDR TDR timeout also.

I wouldn’t call it “more than enough”, I would call that “cutting it uncomfortably close”. It’s basically one GTX 1080 Ti too many.

Rule of thumb for rock-solid systems, based on my considerable experience in overloading and destroying power supplies: The total sum of nominal wattage of all system components should not exceed 60% of the nominal wattage of the power supply. I consider 80PLUS Platinum rated power supplies ideal for workstations, and 80 PLUS Gold the minimum standard.

GTX 1080 Ti is specified for 250W, times four is 1000W. Your CPU has a TDP of 140W. 64 GB DRAM come in at 25W (maybe more because it is highly clocked). The SSD is maybe 5W, say another 10W for motherboard components, networking etc. Grand total of 1180W. So ideally, you’d be using a 1970W power supply. Your current PSU load factor is 78.7%, high enough to almost guarantee occasional hick-ups when running GPU-accellerated applications.

War story: I once operated a workstation with a 80% power supply loading (I hadn’t been able to find a properly sized PSU quickly). Every time I tried to start a particular CUDA application, the machine would reboot.

One reason you want a fair amount of headroom in your power supply is because the wattage ratings of components are not an indication of short-term peak usage. They refer to averaged power: terms like TDP = “thermal design power” are used. Brief (< 50 ms) peaks of 20% over the nominal power are not unusual. If the power supply does not have sufficient reserves, temporary voltage drops (“brown outs”) can occur which negatively impact the reliability of the system.

Running power supplies at high utilization will cause them to operate at higher temperatures, which will age their electronic components faster, eventually leading to faster failure of the power supply.

really appreciate the responses, thank you both.

The power supply may very well be the issue, as once I drop down to 3x GPUs, I’m finding it alot harder to crash the system using redshift and the latest Nvidea drivers. It’s been very stable, with the odd moment where Windows will completely freeze for 10 or so seconds before coming back to life.

With 4 cards could be overheating (which would have to happen very quickly, as with 4 cards I get the BSOD within 5-10 minutes of rendering), or its the power supply issue as njuffa has said with his experiences.
I’m definately leaning towards a power issue with the war story just told!

I increased the TDR to 60 seconds, but it didn’t stop the BSOD crashes with 4 GPUs.

A TDR watchdog timer event would not lead to a BSOD. Once the watchdog timer kicks in after two seconds of no GUI updates, it will trigger a reset of the graphics subsystem, destroying any CUDA contexts as part of that.

The screen typically goes black for a few seconds while the graphics stack re-initializes. After that, the GUI will be functional again. Any previously running CUDA apps will now be inoperational due to lack of a CUDA context. If the apps perform proper CUDA error checking, they will detect the resulting CUDA error and terminate.

A BSOD is a much more serious event that takes down the entire operating system. By all means check power supply and thermals of both CPU and GPU. There may be a system log that captures the approximate reason for the BSOD (e.g. a machine-check error). I have not dealt with BSODs and Windows system logs in many years. You might have to search the internet what to look for and look at.