Quadro RTX 6000 does not handle BF16? Please make an update?

ken_shoryu_ai · November 7, 2024, 11:12am

Hello,
I am using a quadro RTX 6000, and I have a problem with GENERATIVE AI.
I met at least 2 instances where my card could not handle things that other cards with less VRAM would do.
My card has 24GB vram, yet I am not able to handle bf16 and not able to run some scripts.

I am specifically using ComfyUI program.
Many models used are BF16 or need BF16, for example the “PULID FLUX” model, and the latest AI video gen model “MOCHI”.

I talked with the creator of ComfyUI and he suggested to modify the comfyUI code to allow for float16, so we commented a line of code and allowed the program to allow float 16:

#supported_inference_dtypes = [torch.bfloat16, torch.float32]
supported_inference_dtypes = [torch.float16, torch.bfloat16, torch.float32]

The result was that comfyUI was able to run the bf16 model (with float16) but the video was corrupt at the end.

__
Maybe it is failing to handle bf16 so it is turning to f32 → which makes the generation ultra slow and not effiscient → ultimately the card would fail at the generation whereas lower vram card that can handle bf16 would not fail.
Which is a shame isnt it?
__

I need an upgrade or driver or workaround that can make my card work with BF16? Can you do that NVIDIA please?
I mean this is not bad card. And it is still used. It has 24GB vram after all. It deserves to get an update to handle this problem.
__

As for the other AI model, called “PULID FLUX”, the comfyUI would simply show the error

RuntimeError: expected scalar type Half but found BFloat16

Please help.

Btw, I don’t know if this is the right place or not.
I tried “pip install nvidia-cudnn-cu12” and my card would have the same restrictions related to the mochi ai model generation (the screenshot above).

What can be done please? And tell me if there are other subforums I should post there. thanks

Robert_Crovella · November 7, 2024, 2:52pm

bf16 support came in at the Ampere generation of GPUs. Quadro RTX 6000 is a Turing GPU (the generation prior to Ampere.) It does not natively provide support for bf16 calculations.

ken_shoryu_ai · November 7, 2024, 3:03pm

Yes and I desire or I am wishing/asking, for some workaround, like a patch specific for quadro 6000 that can let it do it somewhat?
Please NVIDIA.
There must be something that can be done no?

ken_shoryu_ai · November 7, 2024, 4:54pm

Where can I send an official email to NVIDIA, begging them to make quadro 6000 work bf16 with some workaround or driver or patch?

Robert_Crovella · November 7, 2024, 4:57pm

If you are requesting a change to CUDA behavior or documentation, the way to do that is by filing a bug.

Curefab · November 7, 2024, 5:12pm

The question is not fully hypothetical:
The Turing hardware seems to accept SASS instructions with BF16 and TF32, if provided like assembled for Ampere. But if Nvidia has not publicly supported them up till now, why should they do now?

Perhaps there were hardware bugs or business reasons. At least it would be a huge task for testing, updating libraries, etc. Nvidia probably would not do it for a single GPU out there.

ken_shoryu_ai · November 7, 2024, 5:32pm

What I don’t understand is that this list of card ( How do I configure my NVIDIA GPU to support BF16? - Massed Compute)
Shows only a tiny number of cards.
yet all the bugs I obtained when using this card, when using the software called “ComfyUI” for generating content with AI, well those bugs or problems occur to Quadro 6000 but not to RTX 4090 (or even lower than 4090)

4090 and other cards are NOT mentioned in the list of card able to handle bf16 am I wrong? Or is that list wrong?

My question is why would rtx4090 and lower ones are able to run models that rely on bf16 and generate content correclty whereas my card struggle?

The software I am using (ComfyUI) has the following line of code:

The commented line was added as a suggestion.
__

So the theory I was given was that sicne my card cannot handle bf16, then it turns to f32 and thus huge spike in VRAM usage, and program freezes/fails.

My concern is why other cards with LESS VRAM, can do things that my quadro 6000 cannot?

njuffa · November 7, 2024, 5:41pm

The RTX 4090 is based on the Ada Lovelace architecture which is newer than the Ampere architecture, which in turn is newer than the Turing architecture that the Quadro RTX 6000 uses. Quadro RTX 6000 was launched in August 2018, RTX 4090 was launched in October 2022. Per earlier comment BF16 support was added with the Ampere architecture.

Make sure not to mix up the Quadro 6000 (Fermi), the Quadro K6000 (Kepler), the Quadro M6000 (Maxwell), the Quadro P6000 (Pascal), the Quadro RTX 6000 (Turing), the RTX A6000 (Ampere), and the RTX 6000 Ada Generation (Ada Lovelace). NVIDIA typically applies the “6000” name component to the fastest part in their professional workstation line of GPUs.

General comment on lists: They have a tendency to be incomplete or out-of-date, because most are manually maintained.

Because memory capacity is not the only relevant metric?

Robert_Crovella · November 7, 2024, 5:50pm

Here is what I see there:

System Requirements

Before configuring your NVIDIA GPU to support BF16, you’ll need to ensure that your system meets the following requirements:

NVIDIA Ampere or later GPU architecture

That looks like a very accurate description, to me. It lines up with the statement I already made in this thread.

Subsequent to that, there is this:

(e.g., A100, H100, L40, or A6000)

e.g. is “for example”, and it means, in this context:

“Here are some examples of cards that fit the previous definition (Ampere or later)”

It does not mean:

“Here is an exhaustive list of all cards that fit the previous definition”

As already stated, RTX40 series cards are Ada Lovelace generation GPUs, which is a GPU architectural generation that came after Ampere. Therefore it fits the definition provided (Ampere or later), even though it may not be in the non-exhaustive list. That list only provides a few examples, it does not provide every example. And “for example” does not in my experience mean “here is every example”.

Although simply switching the code from bfloat16 to float16 may work in some cases, it is not guaranteed to be identical or work in every case. The two formats are different, and in particular bf16 has more range (larger exponent space/storage) than f16, so it is less susceptible to overflow than f16. This could certainly lead to problems.

ken_shoryu_ai · November 7, 2024, 6:04pm

Because memory capacity is not the only relevant metric?

I see.

The RTX 4090 is based on the Ada Lovelace architecture which is newer than the Ampere architecture, which in turn is newer than the Turing architecture that the Quadro RTX 6000 uses

I understand.

e.g. is “for example”, and it means, in this context:

“Here are some examples of cards that fit the previous definition (Ampere or later)”

Yeah that makes sens. I for some reason thought that was the complete exhaustive list.

Although simply switching the code from float16 to bfloat16 may work in some cases, it is not guaranteed to be identical or work in every case. The two formats are different, and in particular bf16 has more range (larger exponent space/storage) than f16, so it is less susceptible to overflow than f16. This could certainly lead to problems.

Yeap, that’s why I obtained a noisy video. (in that generation process)

But you know what’s crazy?
In the case of the repo “PULID FLUX” ( GitHub - ToTheBeginning/PuLID: [NeurIPS 2024] Official code for PuLID: Pure and Lightning ID Customization via Contrastive Alignment)
I am able to make it work with a “local” install if I follow the specific instructions ( PuLID/docs/pulid_for_flux.md at main · ToTheBeginning/PuLID · GitHub)

BUT, if I try to run it, INSIDE COMFYUI program thought a custom node for example (such as: GitHub - cubiq/PuLID_ComfyUI: PuLID native implementation for ComfyUI)

I get problems related to bf16 (most likely), the error is like:

RuntimeError: expected scalar type Half but found BFloat16

and it comes after this:

Which is 100% due to my card.

__

So I am wondering.
How was my card (that is not supposed able to handle bf16) be able to run pulid FLUX locally, but inside comfyUI it would not work?

It’s so interesting. And I believe if this was investigated perhaps we could find a way to make the quadro 6000 work with some models that require bf16?

Perhaps is it the version of torch/cuda used in the local install? ( See link above related to instructions for pulid flux linstall locally)
Or perhaps that pulid inside comfy was changed?
or is it comfy itself that has a problem.

I wonder.

njuffa · November 7, 2024, 6:06pm

That would be a question best addressed to the authors / vendors of those applications. I certainly had not even heard of this software until just now.

ken_shoryu_ai · November 7, 2024, 6:29pm

That’s why I inserted the “url”/links to these github repos. Hope someone looks into it or investigate.
The author of ComfyUI simply suggested to “try to” upgrade “pip install nvidia-cudnn-cu12” and to modify the code to allow the program to do float16. The result as seen in the screenshot from first post was a corrupt result. (That’s for the model video generation MOCHI). He himself does not know.

The authors of pulid would not know I think.

This is a question about quadro 6000 because it is about how it is working on one program and not another. Each author only knows about their own program and could not explain the differetial behaviour of their card. Whereas NVIDIA knows “ALL/BEST” about their card , and could investigate!

njuffa · November 7, 2024, 6:41pm

The only question here that pertains directly to the Quadro RTX 6000 is whether it supports BF16 floating-point, and that has been answered: No, it does not.

Properties of GPU-accelerated software not produced by NVIDIA are generally off-topic in the NVIDIA developer forums, and certainly off-topic in this sub-forum which is dedicated to CUDA.

I understand you may be frustrated that you cannot access certain of features of some software because you are running on an older architecture GPU. The appropriate “workaround” is to acquire a newer GPU based on a more advanced architecture. Alternatively, learn to live with limitations of your current hardware. As the 1960s song says: “You Can’t Always Get What You Want.”

ken_shoryu_ai · November 7, 2024, 7:22pm

Tell if there is any better subsection of the forum I would go to it.
I guess my post can be redirected to the issue that the card was able to run a program that need bf16 actually (this one: PuLID/docs/pulid_for_flux.md at main · ToTheBeginning/PuLID · GitHub)
Despite the card not handling bf16.
I would like to know/investigate:

Why?
How?
And if we can find a workaround that work for future bf16 needs for other people and other same settings.

reminder (this same project ‘pulid’, when incorporated to comfyUI ( GitHub - cubiq/PuLID_ComfyUI: PuLID native implementation for ComfyUI) it is no longer possible for quadro 6000 to run it.
(manual cast from bf16 to f16… etc)

The difference possibles are:
Different versions of torch/ cuda.

In pulid project (alone) it is:
PyTorch version: 2.5.0
CUDA version: 12.1
py 3.10

in comfyui:
py 3.11
PyTorch version: 2.5.0+cu124
cuda 12.4
(I will confirm again later, since I did not retry since I did a pip install cu)

Curefab · November 7, 2024, 7:53pm

The most likely reason is that PuLID has some workaround for cards not supporting BF16. Perhaps the workaround has to be activated with certain compile flags, which were not used in the comfyUI version. That is, why the authors of PuLID would best know, how there program could work on a GPU without BF16 support.

ken_shoryu_ai · November 7, 2024, 7:59pm

You might be right yeah. I will try to investigate their more. I guess by posting here I hoped NVIDIA itself (anyone) could look into it.
I seen them post “updates” specifying generative AI by name so I know there are people working on these products.

ken_shoryu_ai · November 15, 2024, 10:18pm

NVIDIA. Please make me some sort of driver that makes the Quadro RTX 6000 run BF16, anything. Even if it reduces its capability. I don’t know. You can do it.

Topic		Replies	Views
Quadro RTX 6000 does not handle BF16? Please make an update? CUDA Setup and Installation	0	74	November 7, 2024
problem running demos CUDA Programming and Performance	9	8232	January 1, 2009
Running CUDA on Quadro FX 1500M - G71: possible? CUDA Programming and Performance	2	4719	February 7, 2009
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75892	February 9, 2009
GPU crunching for science can nVidia please iron out the drivers? CUDA Programming and Performance	19	42130	November 4, 2007
Quadro FX 1500 & CUDA ? CUDA Programming and Performance	0	2777	December 11, 2009
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10896	August 15, 2010
CUDA 2.0 Beta 2 GTX support, more Linux distros... CUDA Programming and Performance	29	55735	October 30, 2008
Newbie Question (Quadro 600 vs GTX 460 1GB ) CUDA Programming and Performance	0	25481	January 30, 2011
CUDA vs ATI Stream comparison CUDA Programming and Performance	22	93762	March 12, 2010

Quadro RTX 6000 does not handle BF16? Please make an update?

System Requirements

Related topics