Hello! I have the following problem, An A100 card is installed in a PC with Windows Server 2022. It can be seen all is fine. But it refuses to do calculations. The error is like
Just an update. It appeared that matter is not in Display Mode at all but rather cudaMallocAsync is not working on A100. If to switch to cudaMalloc all is fine and running. So now another question appeared - why cudaMallocAsync is not working on A100?
A100 supports cudaMallocAsync. To determine if cudaMallocAsync is usable in a particular setting, please follow the instructions. The problem here may be that whatever that code is trying to do is not supported on windows.
Many thanks for the reply! Just checked and cudaDevAttrMemoryPoolsSupported is 0. Which means that cudaMallocAsync is just not supported, correct?
I’m very curious why is it not supported? Does it mean that A100 is just uncapable of working in asynchronous regime in Windows? I tried to run a TensorRT inference simultaneously in many cuda streams. Performance improved but just a little bit. The same on say RTX 4090 produced much more substantial performance boost.
And if you don’t mind another question. What do you think, can A100 outperform RTX 4090? If used in MIG mode e.g.?
If cudaMallocAsync is not supported then does it mean that other cuda operations are affected too if used with cuda stream as a parameter? Like worse performance or maybe cuda stream present as a parameter but ignored and so on?
Is A100 designed to be used for Linux and using it for Windows is kind of possible but not preferable and not optimal?
I’m not aware of any concerns like that. The big thing (IMO) to be aware of on windows is being in TCC mode vs. WDDM mode (if you study your posting/output here, you will see the GPU is in TCC mode, so that is “good” - I don’t think an A100 could actually be in WDDM mode, but many other NVIDIA GPUs can be.) Other than that, if cudaMallocAsync is not supported, then that should mainly have the obvious implications for doing (or not) stream-oriented memory allocation.
Before cudaMallocAsync came along, I would have always suggested when I teach CUDA to get certain kinds of operations out of what I call performance loops - the areas of code where work is being issued to the GPU. One of those things to avoid is cudaMalloc. If you can use cudaMallocAsync (and do it well/correctly) then this concern pretty much goes away. Therefore if cudaMallocAsync is not available, then I would revert to my normal coding advice - if at all possible do cudaMalloc operations up front, before getting into the “performance loops” and as much as possible re-use allocations. It’s still good advice, in any CUDA programming setting, in my opinion.
Our GPUs are designed to work as well as possible in either Linux or Windows. A100 is not an exception. However the OS is not something that NVIDIA has full control of, so limitations presented by a particular OS are often things that cannot be worked around in CUDA. A big one is one I mentioned already - WDDM vs. TCC. For anyone who is doing significant GPU computing work on windows, I would always suggest TCC mode if possible, because WDDM creates a much more significant set of limitations. Those limitations are mentioned in other places, I don’t have a list to present here, but a big one is the limit on kernel execution duration that is present in WDDM and not TCC.
I consider the differences between TCC operation and linux operation to be pretty small, but they are obviously not zero - it seems we have a case right here, and I don’t know all the technical underpinnings of why or why not cudaMallocAsync might be available in one setting and not in another.