The relationship between GPUDirect RDMA, GPUDirect P2P, NVidia IPC, NCCL, and NVSHMEM

kchiu1 · December 16, 2024, 4:47am

Would it be correct to say that IPC, NCCL, and NVSHMEM are all APIs built on top of GPUDirect, and that GPUDirect RDMA is used for inter-node communications, while GPUDirect P2P is used for intra-node communication?

I’ve also seen some commentary that seems to suggest that Nvidia IPC (in the form of cuIpcOpenMemHandle) is used for single-GPU only, but between different host OS processes.

Robert_Crovella · December 16, 2024, 7:01am

GPUDirect RDMA at the lowest level provides for data exchange “directly” between a GPU and a non-GPU device like a Networking Adapter or possibly a FPGA device in the same node. When we extend beyond the lowest level, especially with networking adapters, it can provide for inter-node communication.

GPUDirect P2P provides for “direct” GPU to GPU communication, within a node only.

IPC stands for inter-process communication. It’s not unique or specific to CUDA. CUDA IPC refers to the idea of making a device-side allocation (e.g. created with cudaMalloc) “visible” to multiple processes, generally within a node/OS instance.

NCCL is a communication library that is intended to support the types of collective communication that you might see with MPI. For example broadcast, and all-reduce operations are supported by NCCL. NCCL attempts to find optimal collective communication patterns to make this most efficient, having some knowledge of both MPI and NVLink.

NVSHMEM is a collective operations library most similar to OpenSHMEM. It creates the idea of “shared memory” as a communication paradigm amongst multiple processes. Like MPI, it requires a bootstrap system, and has a library that is used to initialize communicators and to perform collective communication. The basic paradigm is that of shared memory which is addressable and accessible from multiple processes; this shared memory can be on the device (as if it were allocated via cudaMalloc), but otherwise has a similar use-intent to OpenSHMEM.

CUDA IPC does not depend per-se on GPUDirect RDMA. GPUDirect RDMA can be used for multiprocess communication whether those processes are on separate nodes or the same node.

CUDA IPC shares handles to memory or events that have an association with a particular GPU. Typically CUDA IPC is not used in the same process, has no use in the same process, and cannot be used in the same process.

CUDA IPC leverages some lower-level attributes from GPUDirect P2P. GPUDirect P2P (as used in e.g. cudaDeviceEnablePeerAccess or cudaMemcpyPeerAsync) provides for direct communication between GPUs, but not between processes.

alecchien · December 26, 2024, 6:24am

How to check Geoforce Series support GPUDirect P2P ? Or which Geoforce Series can support GPUDirect P2P?

Robert_Crovella · December 26, 2024, 3:00pm

I don’t have any list.

If you have an actual setup, the final arbiter for P2P support is the result of cudaDeviceCanAccessPeer(). Other than that, I’m unaware of any list provided by NVIDIA.

luke.satin · January 17, 2025, 1:39am

But basically we can do what we did in 90’s on 386 and 486 right?

Buy 128GB or 256GB DDR5 RAM and convert it to RAM disk hahaha :-D

Utilize some special driver so it won’t be visible to every process in Windows and then put your fancy GPUDirect Storage in use. Instead of NVMe, we use RAM Disk.

What I hate is that you still don’t have any direct GPU VRAM to RAM access without CPU!!! WHY??? I want have hot swap of AI/ML models to RAM and back. The world is so slow, in the end, I need to implement everything myself and I get older and older.

njuffa · January 17, 2025, 2:53am

Since the dawn of CUDA cudaMemcpyAsync() could transfer data from the GPU to system memory using the GPU’s DMA engine. Since 2015 or thereabouts CUDA has supported unified memory: cudaMallocManaged(). If you insist on using the same physical memory for both CPU and GPU, you can do so with NVIDIA Jetson platforms.

What exactly are you trying to accomplish?

luke.satin · January 17, 2025, 3:16am

Thank you for quick reply! I google for hours and could not find anything on keywords “offloading or caching GPU to RAM” or “DirectGPU to RAM” etc. I know there is support for NVMe and network devices.

So my current hardware is i9 Ultra Core 285K with AVX-512 VNNI (for Intel AI and DL Boost in INT8), then Intel AI NPU (also INT8), then 128GB DDR5 RAM and Geforce RTX 4090 24GB VRAM on PCI Express 4.0 (motherboard can do PCIe 5.0).

Now even with quantized models there is never enough RTX VRAM. Also there is so many opportunitios like ComfyUI workflows and I want to explore it. As developer, I also want to host locally some AI coding agent or more of them. Also speech to text, text to speech and so on.

Because I am developing some chat platform with AI assistants, on a budget, so I focus mostly on my small country in Europe.

I read about NVIDIA Triton server and they focus mostly on multi-gpu.

I want to develop OpenAI API compatible gateway that will allow you to seamlessly communicate with commercial services like ChatGPT, Claude. But there are some projects that are not fully compatible with OpenAI API spec and my gateway would act as a proxy and translate messages so you could use your desktop or mobile app you are used to (they support configuring OpenAI proxy), but you could choose much more models.

Now I have also Intel NPU, so I would like to explore OpenVINO.

I don’t like vLLM (that NVIDIA Triton is using) because it is buggy and they dont support latest quantized models. I found out that Deepseek v2 was running on SGLang (which was based on vLLM but now they switched to custom implementation). So SGLang fully supports OpenAI API including FIM - Fill In The Middle completition of DeepSeek v2 which is handy in VS Code with Continue.dev extension.

I will also implement same API like Ollama, but I will replace Ollama with SGLang. So the offloading of models will be better, it will eat less VRAM and be much faster.

The ultimate goal is to be able to quickly offload some memory from GPU, like when you hot-swap server disk :D and load new model so you can start generating photos or videos. SGLang is in Python so I would probably need to get exact GPU memory address and send it to RAM. It would stay in ram, the size is okay. And then you could very quickly load old models from RAM to GPU. So RAM would serve as short-lived models database. The crucial thing is that the RAM should be isolated (not sure if pinning is enough) so it would be exclusively for GPU, not shared with any OS process or CPU.

SGLang have their own CUDA based engine, so they do not support Tensor RT cores which I read are faster for NVIDIA. I would like to explore TensorRT and maybe have some server hosting TensorRT as well - so I think in that case I could not use the archaic CUDA instructions you propose.

I am a bit of perfectionist, maybe even close to autistic spectrum with my nitpicking :D so I dont care how complex this will be. I dont care if I need to write custom UEFI firmware with Intel provided binaries for chipsets and access RAM directly (cannot access GPU VRAM directly using UEFI firmware because NVIDIA is not that open with DIY as Intel). Or I could write my custom Windows kernel driver for providing a RAM Disk while hiding it from OS and then using DirectGPU Storage and trick NVIDIA driver that RAM disk is NVMe. Then I would release it as opensource. I already have dashboard and the proxy routing OpenAI stuff so now is the right time to explore fast serving of multiple models (just for inference, no training and machine learning) and optimize it for single GPU. So other AI artists could download my 1-click installer and replace their solution with my OpenAI gateway and be more creative on single GPU.

Curefab · January 17, 2025, 10:36am

Before solving the software issues, first think about, when and how much data is exchanged or copied and over what interface that should happen. You cannot improve beyond the speed of the hardware interface.

PCIe? NVLink?

How much memory should the GPU have? How often would you exchange the data? How long would it take?

luke.satin · January 23, 2025, 12:22pm

Hi, sorry for the delay, I was sick. But I did research a lot of stuff and documentation and it came out very simple: I just need cuFile.cc / cuFile.cu - it is advertised that it is in Cuda Toolkit but it is not there. It is not even on NVIDIA Github.

Well, basically there is community of researchers and AI artists - they complain on NVIDIA Github as well as Microsoft Github regarding some Windows version. I found out it is pretty simple, I got even AIO POSIX compatible library for Windows thanks to Intel AI Toolkit. I dont strictly need GDS. There are many 3rd party frameworks already. Basically I solved all issues for all dependencies, but the last one that is missing is cuFile.cu and cc not in CUDA Toolkit despite it being advertised on NV web and also across various channels.

I will build some Windows specific multi-model serving tool including Ollama API and OpenAI API compatible and I will give it to Windows AI community, most of them have RTX 4090, some even 5090. So if I recompile everything, enable all features, the RTX will get boost between 20-30% and we will do some benchmark and NVIDIA will be far ahead against competition.

But now it is up to you how the story will continue :-)

Have a nice day!

Curefab · January 23, 2025, 12:53pm

Could you link or describe, where those files are mentioned?
Here? Magnum IO GPUDirect Storage | NVIDIA Developer

Have you downloaded one of the mentioned versions?

Or is it strictly about the Windows version?

luke.satin · January 23, 2025, 2:07pm

I am in office now and have meeting. So you dont want to provide this one file that is present in Linux package? I think that would break a few laws here in European Union.

striker159 · January 23, 2025, 2:19pm

At least CUDA 12.6 on linux includes cuFile. There is $CUDA_HOME/include/cufile.h and the corresponding library files in $CUDA_HOME/lib64/libcufile*.so

Robert_Crovella · January 23, 2025, 11:06pm

cuFile is part of GPU Direct Storage (GDS).
GDS requires linux.

luke.satin · January 25, 2025, 10:16pm

I apologize in advance that part of this post might not seem positive but I will present some solution in the end. I believe that this has become a long standing issue and I happen to be in some role of a mediator by accident - and the stance should be stronger now to communicate where is the borderline for customers (for me mainly EU citizens). Even it might not sound positive at the start, I will try to present some options at the end to prevent further escalation and show a willing to solve it in win-win manner and deescalate so we can all continue living a peaceful life:

@Robert_Crovella Hi Robert, no you are not right. You lie now. And I am very sensitive when someone lies (Asperger).

Next week I’m launching innovative STEM education AI portal with assistants so I was not able to find all places where you mention cuFile is not part of GDS and Magnum IO, but I found two relevant things at least:

There is clear mention that cuFile is standalone independent user library and even mention of future plan that it should be merged to cuda.h. There is no mention on GDS and same information is spread across 3 web pages on your domain.

Here you can see libcufile*.deb and no GDS - that is separate. So no mention in DOCS, no connection in DEPLOY.

Please I know I came here angry, but now I’m trying to find some solution. I will propose solution at the end of this post, but first we should analyze what is happening, so we have better understanding of context and potential risks of escalation. We should deescalate. So please, I understand you need to represent stakeholders and protect their privileges, just this was unlucky example. It seems to be more business oriented approach rather than scientific analysis.

Magnum IO was the first implementation, but recently your company evolve into new proposals in terms of standalone GPUDirect functionality which targets different market than Magnum IO.

In current deployment, cuFile.so is located at \usr\local\cuda-12.5\targets\x86_64-linux\lib
Therefore it is located under cuda-12.5 toolkit. There is no mention of Magnum IO and no connection with GDS. Further more cuFile license and changelogo nowhere mentions Magnum IO nor GDS, there is contact on cudatools team only:
libcufile (1.10.1.7-1) stable; urgency=low

Automatic Debian package build

– cudatools cudatools@nvidia.com Thu, 06 Jun 2024 10:53:01 +0000

Documentation, license and comments in header files for GDS clearly states that GDS is only a wrapper around cuFile. From software engineering point, if we use OOP and Design Patterns, GDS could have several implementations and wrappers and extend functionality with new implementation than wrapping cca four method calls around cuFile.

I would like to point out what I read somewhere, don’t remember where exactly it was stated that cuFile is using POSIX for read and writes and that is actually bottleneck and has performance impact. So, if I would have the opportunity to implement Windows support for you, for free, I would not use this POSIX approach and my implementation with further enhancements would be in fact faster than Linux implementation.

Next, in cuFile.h you already implement some kind of Windows support:
CU_FILE_HANDLE_TYPE_OPAQUE_WIN32 = 2, /*!< Windows based handle */

union {
int fd; /* Linux */
void handle; / Windows */
} handle;

I don’t want to be offensive; I’m just doing analysis and pointing out facts that are already there. Now we continue about license information:

You claim that creating the SDK did cost you quite significant financial expenses and you use commercial computer and commercial documentation and that you are giving it out for free.

Here I would like to point out that I need to do the same now and instead of doing actual work or spending time with family, I need to solve a lot of issues only with you. Other PyTorch CUDA dependencies and corporations like Microsoft or INTEL do not create so many technical issues. So, I do it also for free and I am willing to share it solely with you and you can use your license. But this will cost me around $5000.

I would like to point out that in your license there is a list of third parties and it seems most of your code is taken from universities, so in the end you might not have si significant financial losses, in naive case it could be just CTRL+C and CTRL+V and glue everything together:

Licensee’s use of the GDB third party component is
subject to the terms and conditions of GNU GPL v3

Licensee’s use of the Thrust library is subject to the
terms and conditions of the Apache License Version 2.0
In addition, Licensee acknowledges the following notice:
Thrust includes source code from the Boost Iterator,
Tuple, System, and Random Number libraries.

Licensee’s use of the LLVM third party component is
subject to the following terms and conditions:
University of Illinois/NCSA
Open Source License

Licensee’s use (e.g. nvprof) of the PCRE third party
component is subject to the following terms and
conditions:
University of Cambridge Computing Service,
Cambridge, England.
Copyright (c) 1997-2012

STACK-LESS JUST-IN-TIME COMPILER
Copyright(c) 2009-2012 Zoltan Herczeg
All rights reserved. (Hungary)

THE C++ WRAPPER FUNCTIONS
-------------------------
Contributed by: Google Inc.
Copyright (c) 2007-2012, Google Inc.
All rights reserved.

Some of the cuBLAS library routines were written by or
derived from code written by Vasily Volkov and are subject
to the Modified Berkeley Software Distribution License as
follows:

Copyright (c) 2007-2009, Regents of the University of California

Some of the cuBLAS library routines were written by or
derived from code written by Davide Barbieri and are
subject to the Modified Berkeley Software Distribution
License as follows:

Copyright (c) 2008-2009 Davide Barbieri @ University of Rome Tor Vergata.

Some of the cuBLAS library routines were derived from
code developed by the University of Tennessee and are
subject to the Modified Berkeley Software Distribution
License as follows:

Copyright (c) 2010 The University of Tennessee.

Some of the cuBLAS library routines were written by or
derived from code written by Jonathan Hogg and are subject
to the Modified Berkeley Software Distribution License as
follows:

Copyright (c) 2012, The Science and Technology Facilities Council (STFC).

All rights reserved.

Some of the cuBLAS library routines were written by or
derived from code written by Ahmad M. Abdelfattah, David
Keyes, and Hatem Ltaief, and are subject to the Apache
License, Version 2.0, as follows:

 -- (C) Copyright 2013 King Abdullah University of Science and Technology

Some of the cuSPARSE library routines were written by or
derived from code written by Li-Wen Chang and are subject
to the NCSA Open Source License as follows:

Copyright (c) 2012, University of Illinois.

Some of the cuRAND library routines were written by or
derived from code written by Mutsuo Saito and Makoto
Matsumoto and are subject to the following license:

Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
University. All rights reserved.

Some of the cuRAND library routines were derived from
code developed by D. E. Shaw Research and are subject to
the following license:

Copyright 2010-2011, D. E. Shaw Research.

Some of the Math library routines were written by or
derived from code developed by Norbert Juffa and are
subject to the following license:

Copyright (c) 2015-2017, Norbert Juffa
All rights reserved.

Licensee’s use of the lz4 third party component is
subject to the following terms and conditions:

Copyright (C) 2011-2013, Yann Collet.
BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)

The NPP library uses code from the Boost Math Toolkit,
and is subject to the following license:

Boost Software License - Version 1.0 - August 17th, 2003

Portions of the Nsight Eclipse Edition is subject to the
following license:

The Eclipse Foundation makes available all content in this plug-in
("Content"). Unless otherwise indicated below, the Content is provided
to you under the terms and conditions of the Eclipse Public License
Version 1.0 ("EPL").

Some of the cuBLAS library routines uses code from
OpenAI, which is subject to the following license:
The MIT License

Licensee’s use of the Visual Studio Setup Configuration
Samples is subject to the following license:

The MIT License (MIT) 
Copyright (C) Microsoft Corporation. All rights reserved.

Licensee’s use of linmath.h header for CPU functions for
GL vector/matrix operations from lunarG is subject to the
Apache License Version 2.0.

The DX12-CUDA sample uses the d3dx12.h header, which is
subject to the MIT license .

I don’t want to show disrespect, but it seems a lot of opensource licenses have been used in your SDK and I think there is not that many files in contract to this amount of licenses. Maybe it would be helpful to also document what part of code is actually yours only.

So, I’m done with nitpicking, but it was fair and needed to provide some analysis for all parties here.

IMPORTANT FACT: Anyone asks about cuFile won’t have any response for 3-5 years. Here I’m glad some discussion was opened, but it seems there is some will to steer the conversation. But we did not get to the point and NVIDIA never provide explanation or reason while cuFile have special place in your SDK and is so super secret that university or your opensource community or even commercial entities cannot have access to it in the same manner as other CUDA TOOLKIT files. In cuFile there is already hint of Windows support, so it would not take much effort. Nobody here knows the reason, but if it is not viable for you or you need to focus on something else, I think that could be accepted. Perhaps someone from community could help you finish it under some NDA or contract or sublicensing and share the result of work with you so you could benefit from reaching more markets and you would not have tens of angry scientists, Ph.D. teachers and developer.

Imagine these people invest mostly in RTX 4090 and now they share photos that they upgrade to RTX 5090. So it is not cheap and these people are not teenagers. But they can feel that it is not fair that because of one cuFile, they cannot use the expensive hardware of yours to full potential. Nobody needs GDS wrapper, you can keep it. Everyone wants only cuFile.

In the docs you mention that GDS and cuFile is aimed at datacenters and clouds. We fall into that category and have university or on-premise clouds mostly for the initial development and observations and it is using Windows Server. Then if we need to scale it up, we go to cloud by 3rd party provider and at that time it is not a big problem to use Linux.

Why there is so much tense from professional community regarding Windows + WSL2 for Cuda? Simply because when you develop, imagine you have 20 Python venv or conda virtual environments. You don’t want Pytorch inside WSL2 because:

shared filesystem is a problem
when some Python Pytorch app spawns webui on some port, it is extra work to make it accessible to Host OS, very inconvinient
on top of that if Pytorch is providing some GUI with dialogs (we are still in development and testing phase), it is very cumberstome to render Linux GUI inside WSL2 to Windows. Mostly it means you cannot have headless Ubuntu and you cannot even use WSL2. Standard approach, from what I know, is that you need to purchase X410 and instead of WSL2 you need to use Hyper-V, then you are able to redirect just one window from X11 to Windows using sockets.
during development on Windows you also provide help or analyze 3rd party solutions, these mostly use older Python or older CUDA Toolkit, in Windows you can easily switch priority of CUDA version in Environment Variables dialog. In Linux I don’t know how it would be difficult, but if WSL2 is using direct DMA GPU access to Host OS, it means now you need to do these changes on two places, in Windows as well in Linux

Therefore, if everyone could develop using VS Code and local PyTorch and just route calls from libraries that call DLLs to WSL2 container to .so shared libraries, it would be much better. If the WSL2 with CUDA would have implementation with all proper dependencies, that would mean performance gain around 20-30%. So the WSL2 Ubuntu container should not mean you need to move whole development environment inside this container. This container you provide with WSL2 Ubuntu has cuFile and because of WSL2 it was never meant to be used in the cloud actually, right? This approach, design and architecture is therefore wrong and misleading. WSL2 Ubuntu seems to be a special use case or edge case and it should act as supporting backend only - that would be used from outside and the only purpose of it would be CUDA TOOLKIT.

Now I have a few options in mind and you can choose the one that would be the best for you with least effort:

the people around I talk with mostly work at universities across Europe and have even Patreon or Youtube to share knowledge and spread and promote NVIDIA technologies and products, me personally is working on STEM education platform - like ChatGPT but with a lot of individual assistens and including real-time avatar with microphone communication in many European languages: Would it be possible if we have some association of universities and want to ask you for licensing cuFile for non-profit?
Microsoft GPUDirect: would it be possible to have 1:1 replacement of cuFile, and possibly GDS, with Microsoft GPUDirect? In Windows 11 it shows that our cards have it enabled and it is supported out of the box
Until we find some solution like licensing cuFile or adding it to CUDA TOOLKIT for Windows or letting someone finish Windows implementation, my idea how not to make everyone angry and stop them complaining here on forums or on Github, I brainstorm various solution with AI and one of them could look like this:
ChatGPT (gpt4-o) would read API documentation of header outline on your website and generate stubs. These stubs would then be cross-compiled for both Linux and Windows. But they would be placed just in Windows, somewhere on class path of Pytorch virtual environment. Now someone call any of these libraries from Windows Dev env (VS Code + PyTorch), this stub would act as a proxy design pattern and the function call with all parameters would be routed to WSL2 container (either using SSH or shared memory). The result would be then returned back to Host OS (Windows) to PyTorch. Therefore developers would focus on Windows only and CUDA TOOLKIT in WSL2 would act as background service. This would provide a seamless workflow and they should even know about it because it would be implemented in that way. So nobody would get angry anymore :-D
The high level outline of this solution is here: Project: Phantom CUDA Bridge · NANOTRIK-AI · Discussion #1 · GitHub

Thank you and have a nice day! And apologize that I probably made you stand up, off the chair, few times :-D
but now you know why it is not good to ask me where I read some claim, I have good memory, but it is difficult to find some nested page and also I am perfectionist so I get stuck in loop and it takes me 2-3 hours to provide some data and write some response.
The purpose of having more agressive stance was to express various frustrations of several people. It is midnight and I need to work on some AI now.

Curefab · January 25, 2025, 10:44pm

Have you confirmed that cuFile does not use any Linux (e.g. driver or kernel) specific features? You mention libmount and libnuma as direct library dependencies, even libcuda could contain some linux-only features.

Even if it can be made to run on Windows, there is still all the testing effort and quality assurance stuff.

The Linux documentation mentions that you need to have the nvidia-fs.ko kernel module installed.

Libraries often have some extra ways to communicate with file systems and do not always need included libraries.
E.g. ioctl(2) - Linux manual page (not saying cuFile uses those).

If you still say that that module is not needed, then first try to get it to run on Linux without. The documentation says cuFile will go into a compatibility mode copying the data over the host side. But you want RDMA.

luke.satin · January 26, 2025, 6:52am

Yes, only POSIX close and open file and sockets. Two imports, I wrote C++ few years ago, so I had to look it up and I have verified how it could be rewritten and it should be trivial.

Yesterday, I read your WSL and Cuda docs and there is mentioned some attempt to provide some kind of shared memory between WSL and Host OS, but there was some bug or tech debt and it was postponed, it was like 2-3 years ago. I think you might had similar idea and that would make having more tight integration possible. I think there was mention of sharing some interOps between Host and Guest OS.

And RTX IO would be also solution with Microsoft DirectStorage, even some benchmark says it is not fast enough. RTX IO is mentioned as open source, but on Github, there is only RTX AI. And I have not found any documentation or concrete file in Game Ready Drivers that would look like RTX IO. I would be able to make GDS like wrapper, just over RTX IO and it could be seamless plugin like transition. In OOP design patterns, this would be probably Adapter design pattern.

On Sat 25. 1. 2025 at 23:46 Sebastian Wittmeier (Curefab Technologies GmbH) via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com> wrote:ng

luke.satin · January 26, 2025, 6:54am

Oh I remember some gamers reports seeing rtxio.dll in game directory while it has not been enabled yet.

luke.satin · January 26, 2025, 7:49am

By the way, I have had my opensource helping moment, even it is not always welcome, in PyTorch: Enable F16C instructions in CUTLASS? At least for the inductor backend? · Issue #132579 · pytorch/pytorch · GitHub

It was about your library: “Enable F16C instructions in CUTLASS?”

It has one crucial, politely: not very nice thing:

  if ((CMAKE_CXX_COMPILER_ID MATCHES "GNU") OR (CMAKE_CXX_COMPILER_ID MATCHES "Clang"))
    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=-mf16c)
  elseif((CMAKE_CXX_COMPILER_ID MATCHES "MSVC"))
    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/arch:AVX2)

MSVC does not have compile option for F16C, you get F16C if you enable AVX2.

F16C is like from 2009: F16C - Wikiwand

F16C is supported since Ivy Bridge - 2012 and last 15 years I had:
“I have i7-4930K, I have F16C support with AVX, but not AVX2.”
(I’m also retrogamer - win98, 3dfx Voodoo 2 in SLI, just few weeks ago I took a loan and got i9 Ultra Core 285K, RTX 4090)

So, I could not run CUTLASS. I think MSVC shows like Bill Gates attitude and Elon Musk said: “Billie G is not my lover” :-D

So now it is recommended to use Mingw-64 for Windows and it is even more compatible with Linux C++ toolchain, which is nice.

Crucial is: you must not use Mingw-64 from Sourceforge: MinGW-w64 - for 32 and 64 bit Windows download | SourceForge.net

You must use this newer fork:

It is more modern, compatible with C++11 (which I think I saw you also use), plus in has Ubuntu, Debian, Fedora builds. For Windows it comes in several flavours like pure Win32/Win64, but they have also POSIX threads builds (which also include Win32 API thread functions).

Ok, thats it, bye.

njuffa · January 26, 2025, 10:25am

You do not seem to have much insight into how industry in general, and NVIDIA in particular, creates libraries to build an ecosystem, such as the one that contributed much to the success of CUDA.

The cost of creating all that software is huge and consist mostly in paying large numbers (think thousands) of highly-paid engineers, scientists, mathematicians, etc. More than half of all NVIDIA R&D staff work on software (that’s according to public statements by the company’s representatives), but obviously not all of them work on libraries.

NVIDIA does occasionally incorporate some third-party material into their libraries if it is under a suitable OSI-approved open-source license (a typical example would be a variant of the BSD license) and offers some significant benefit, but in general NVIDIA much prefers to “roll their own” code. NVIDIA diligently tracks all external material to remain compliant with the respective licenses, and the result is the list you inspected, which represents all third-party material incorporated over the past almost twenty years. Usually the contributions enumerated in the list are quite small.

I will give two examples. You can see my name on the list you reproduced above. This is for code of mine incorporated over the past decade which amounts to maybe ten or a dozen routines or so. This compares to several hundred of such routines that I wrote for NVIDIA during the nine years I worked on CUDA as their employee. You will also see a notice for code by Davide Barbieri. I remember integrating his code into CUBLAS in the very early days of CUDA, and if I recall correctly, it comprises two kernels, out of the many dozens of kernels that collectively make up the SGEMM functionality, where SGEMM represents one of a couple of hundred or so functions in BLAS (but a very important one, of course).

If you are somehow under the impression that NVIDIA largely cobbled their libraries together from publicly available pieces, such an assessment would be wide off the mark.

luke.satin · January 29, 2025, 8:37pm

Thank you, that was not the main point of the post, just observation from information I see in the code in contrast with missing cuFile.

Can we please move on and find some working solution together?

I think it is difficult to get some explanation why cuFile is not present. I think I dont care, I just care about some resolution, I proposed around 3 possible solutions.

Thank you!

Topic		Replies	Views
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9513	May 28, 2013
From NIC to GPU. CUDA Programming and Performance	42	14347	August 21, 2025
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	5356	November 7, 2023
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75957	February 9, 2009
Cross-vendor GPU development strategy CUDA Programming and Performance	20	7066	January 11, 2010
GPUDirect Storage issue on NVIDIA DGX A100 System Storage	10	1920	January 16, 2024
x64 Support CUDA Programming and Performance	31	34453	November 18, 2007
CUDA 4.0 CUDA Programming and Performance	63	507972	March 28, 2013
How to verify GPUDirect Storage's P2P DMA is working correctly for local attached NVMe SSD? GPU-Accelerated Libraries nvme , gds	0	502	August 28, 2024
Enable PureVideo under Linux (MPEG-4 / H.264 XvMC) CUDA Programming and Performance	38	131131	April 19, 2008

The relationship between GPUDirect RDMA, GPUDirect P2P, NVidia IPC, NCCL, and NVSHMEM

Related topics