I actually have what appears to be the exact same model as you - from the device ID you got in GPU-Z and the photo it looks like an EVGA XC3 Ultra RTX 3090 - that’s exactly what I have, I actually got it in-person on launch day at 9AM at Micro Center, so I’ve been running the card on Linux longer than any other consumer out there. I’ve had to RMA the card twice, but due to completely unrelated issues (the first RMA was caused actually by my motherboard, luckily EVGA’s warranty covered it, the second was that the third fan went bad).
I’ve never had these kinds of crashes that you’re experiencing, and while I’ve never run any sort of NN workloads, I’ve put it under full gaming workloads using the tensor cores and RT cores along with the CUDA cores, all at the same time.
As generix said, this absolutely looks like a motherboard issue. It’s not at all hard to imagine a 4-5 year old 1080 Ti not causing issues that the 3090 actually triggers. The amount of difference in GPU power is insane, I believe in just plain rasterization the 3090 is more than double the GPU power of a 1080 Ti (plus the gen 3 tensor cores and gen 2 RT cores, which the 1080 Ti has neither of).
If you would give me some workloads that I could easily run (I already have cuda installed) that trigger the issue for you, I could run them and see what results I have.
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
This is also weird.
But yeah, if there are some benchmarks, workloads, or torture tests that trigger issues for you let me know what they are and I can try them on my system. I have
cudnn already installed.