Run AI program with Razer eGPU Box and NVIDIA 1030 ,stop with Cuda Error

My Hardware:
PC:intel NUC(NUC7i7BNH) with a TB3 port
eGPU Box:(used) Razer Core X
eGPU:ASUS NVIDIA 1030 2GB VRAM

My Software:
OS:Windows 10 22H2 / Linux(Ubuntu 22.04+KDE,Linux version: 6.2.0-39-generic) *Both of Windows and Linux are clear installed just yesterday.
Software: AUTOMATIC1111 -stable-diffusion-webui / oobabooga/text-generation-webui / and some 3D game

nvidia-bug-report.log.gz (118.9 KB)

Situation 1:Windows 10 + AI (image or text)

I can run some AI programs(SD webui or text-generation-webui) to get some pictures or chat with AI program.
I can get about 512*768 or a little bigger images with only 2GB VRAM eGPU without “Out of Memory” Error.
But when I try to generate pictures continuously, the Windows system will freeze in about 1 to 3 hours. So I cannot get any error message by Microsoft Remote Desktop APP or login into the System by Keyboard and Mouse.

Situation 2:Linux(Ubuntu 22.04+KDE)+ AI (image or text)

I also can use AI program to generate images at Linux , like what I do at Windows.
The AI program will report an error after running continuously for about 1 to 3 hours. This will NOT cause Linux to crash, but will cause Stable diffusion to stop.
The Error message is:

RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When such an error occurs, running the nvidia-smi command will return:No devices were found.
But the NVIDIA 1030(as a VGA) and GP108(as an Audio device) is still in the list of command “lspci”.
Command “nvtop” just shows intel Core GPU.(Before Error ,it also show the NVIDIA 1030 as GPU 0).

Situation 3:Windows 10 + some 3D game

No error.
I run some 3D game for more than 10 hours, the usage in the Windows task manager shows that eGPU is be using ,and it’s OK for more than 10 hours.
When running the game, the Windows task manager Tab”GPU” ”3D” is high usage(90%~100) ,and “Cuda” is almost 0%.

————

I think the reason might be one of these:
*My Intel NUC is too old? (It can’t even update the TB3’s firmware anymore)
*Used Razer eGPU BOX?
*Used NVIDIA 1030?
*Compatibility issues with Linux?
*Delusion of running AI programs on 2GB VRAM?

But, Which One?Please help.

————
By the way, My Linux System language is setting to Chinese and I live in Japan.
The “Date:” in nvidia-bug-report.log file shows the Japanese style date and time.

Thanks& Sorry for my poor English.

You’re getting a lot of pcie errors after a while, leading to the connection to the egpu breaking down. AI in contrast to gaming pushes a lot of data over the TB connection. So I guess either the TB chip in your nuc or its counter-part in the razor enclosure is overheating after longer usage.

1 Like

thank you for your reply.
I have observed the temperature myself, and the graphics card temperature displayed in Windows Task Manager - even when running games, is 65 to 70 degrees Celsius.(Maybe as 149~158F ? )

So, is there any command or program that can observe the temperature of the TB chip in the NUC or the Razer eGPU chip?

In other words, is there any way (error log, etc.) to confirm which side of the hardware is overheating?

——I plan to spend a small amount of money to replace some devices and test them, but I want to know which device is faulty.

(Three additional points)
(1) I observed the dmesg error message in my system.
There are indeed a lot of PCIe error log out there.
(2) In this discussion thread, it was also mentioned that NUC7 and Intel’s TB controller have frequent errors.

(3) I recall that some product reviews I saw on the Internet about the Razer Core X Chroma eGPU Box mentioned that it was very unstable when used as a network data connection. (Chroma is a newer generation than the box I’m using)

I guess you should start by checking if pcie errors are also occuring while the gpu is still available, to check for a general issue, e.g. flakey cable. Regarding overheating, I’d rather suspect the nuc due to its small size. Maybe just open it, clean it, and then leave it open, possibly even pointing an external fan at its interior. Then check if you can run prolonged workloads.

1 Like

After a period of observation, I think I have solved the problem of instability of AI programs using eGPU (at least in Linux systems).

In some posts, it was mentioned that the TB3 port in Linux has unstable data transmission.

https://forum.proxmox.com/threads/pcie-bus-errors-with-sonnet-twin-10g-sfp-thunderbolt-3-and-intel-nuc-12-extreme.134932/

https://gitlab.freedesktop.org/mesa/mesa/-/issues/7340

https://askubuntu.com/questions/1394924/35-gb-day-of-pcie-bus-error-severity-corrected-type-data-link-layer-in-sy

Some users mentioned that the following settings can avoid PCIe transmission errors.

pcie_aspm=off

Afterwards, I asked ChatGPT what the parameter meant and how to set it to Off. ChatGPT told me that this is a setting related to energy saving, and turning it on may indeed cause unstable transmission.

I did this (actually I did two things: (1) upgraded Kubutu (Ubuntu+KDE) to 23.10, (2) set pcie_aspm=off)
After restarting, I continuously run AI (stable-diffusion-webui), and currently, it continues to output images for about 12 hours (although my GT 1030 2GB VRAM graphics card can still only produce one image at a rate of 3 minutes), But it seems to be sustainable and stable.And there is no error log of PCIe data transmission error in dmsg.

Thanks to generix for answering the error report above. This changed my focus from “Is there a problem with the NVIDIA driver version under Linux” to “PCIe data transmission error”, and I found the solution mentioned by other users on the Internet. plan.

It looks like this was a Linux kernel issue and has been resolved. But I still don’t know why it crashes in Windows.

Should I mark this thread as “resolved”?

No.
After setting pcie_aspm=off, the system crashed again after about 22 hours of continuous operation. This crash not only caused the AI program to stop, but also caused (about 1 minute later) a black screen and death in the Linux operating system.
After the NUC restarts this time:
KDE Settings Windows-TB3 does not recognize the entire Razer graphics card box.
I am able to see the intel JHL6340 control chip under the lspci command result. But I cannot see the related content of the Razer graphics card box.
Boltctl command returns nothing.
The graphics card box fan is rotating.
nvidia-bug-report.log.gz (172.4 KB)

I uploaded the error report again. If possible, could you please help me see if the error message is different? (Compare this to the last bug which only caused the AI program to crash)

The TB bridge chips are turned off, likely due to overheating. Did you already remove power from both the nuc and enclosing?

Before the system failure last night, I didn’t cut or kick out the power cord (that’s for sure).

After the problem occurred last night, I turned off the power of the intel NUC mini computer and restarted it. However, after restarting, the TB3 device cannot be viewed in the Linux system (KDE-System Settings-TB3 Settings).

After that, I turned off the power switch of the Razer graphics card box and turned it back on. Then everything returned to normal…

I asked ChatGPT just now and its suggestion was to try cleaning the dust out of the port and replace the TB cable. I think this might be a feasible approach.

I think this might be a good suggestion.

My system crashed again about 31 hours later. I just cleaned the TB3 cable and discovered a fact.
The TB3 cable that came with the Razer CoreX that I bought from a second-hand store in Japan… doesn’t seem to be the original cable.
Pictures on some websites show that the original cable is 0.5m and looks like this:
https://9to5mac.com/2018/05/25/review-razer-core-x-egpu-mac-macbook-pro-best-external-graphics-video/

The cable I got is about 1.0m and comes from Cable Matter. It looks like this:

Ok, this may be the cause of the PCIe transport layer error, I plan to replace it.

I bought new TB cables and replaced them, and it ran for about half an hour.
I observed an interesting phenomenon.
Before replacing the cables, I was using 2 meters (yes, 2 meters) of Cable Matter TB3 cable that I bought from a second-hand store with the Razer eGPU Box.

Now I use belkin 1m TB4 cable.
The interesting things I observed are:
When using the TB3 2m cable, I ran stable diffusion webui on an eGPU (NVIDIA GT1030) and the GPU usage and VRAM usage were almost 100%.
Like this: the two lines representing usage overlap.

When using the current cable(TB4 1m), the GPU usage is 100% and the VRAM usage is about 70%.

But the image generation speed is still: 512*768, one is generated every 3 minutes.
……………………
I’ve never heard of changing cables helping reduce VRAM usage while the AI program is running.
Did I make a mistake somewhere?

I think I should continue with an explanation.

Last time, I changed the cable: from a 2 meter TB3 cable to a 1 meter TB4 cable.

However, after replacing the cables, the problem still occurred (stable diffusion) causing Linux to reboot after approximately 30 hours of operation.

Later, last Sunday, I bought an HP EliteBook 1030 G3 at a very cheap price at the second-hand market in Osaka (the battery and screen were badly worn, but for me, these two items are not needed ,to connect the eGPU to continue running AI programs). And this laptop has two TB ports!

After replacing my intel NUC (NUC7i7BNH) with this HP EliteBook 1030 G3, everything became OK.

Since the early morning of March 18, this HP laptop has been generating pictures at a speed of 3 minutes/picture under the Windows 11Pro system, and has been running stock trading for 7*27 hours.

I haven’t tried HP Laptop running SD under Linux yet.

It is certain that the problem lies with the Intel NUC, but I cannot tell whether it is a hardware (power supply, dust), firmware, or software problem.

Thanks.