Hi,
I have a application mode that uses UVA implementation of vLLM for CPU offloading on a GH200. When I run this, I see an unexplained process called UVM GPU1 BH, and also I do not see any profiled information about page faults using nsys profile --cpu-core-metrics=0,2,14 --gpu-metrics-device=all --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true --event-sample=system-wide
Nsys rep attachment
I’m not sure what the “UVM GPU1 BH” process is referring to since it shows high utilization. Has anyone encountered this?
Thanks!
The implementation doesn’t appear to be making use of managed memory (where page faults might occur):
I try to re-implement CPU offloading in a fully transparent way: we offload the tensor to CPU, and let GPU directly view it as GPU tensor. It depends on UVA technology (no clear documentation, but there’re some public discussions ), and per my discussion with nvidia experts, it works for systems with pinned memory.
I don’t have any info on the UVM GPU1 BH process, but it doesn’t appear to be unique to anything you’ve mentioned .
rajeshshashikumar:
I do see cudaHostAlloc()
cudaHostAlloc == pinned memory
I’m fairly certain pinned memory is not subject to demand-paged migration under any circumstance.
cudaMallocManaged is definitely the typical path to create an allocation that is demand-paged migratable (in many settings).
malloc() is typically not demand-paged migratable, but on GH200 it should be.
It looks to me like the vLLM implementation/PR did not have only GH in view, i.e. it was intended to be usable on “any” CUDA capable setup. That would be a probable motivation to use pinned memory for the objectives stated there. If they had used malloc() and then run it on a non-Grace system, it would just fail.
It may not be related to OP’s issue, but I think that HMM is enabled now on all platforms that have a compatible kernel (6.8+ iirc). We have an AMD/H100 system for which my basic HMM test works. On prior kernels it would fail.
// olmalloc_mode determines which allocator we use
// 0 is malloc (hmm only!) (default)
// 1 is cudaMallocManaged
static int olmalloc_mode = 0;
void* olmalloc(size_t bytes)
{
void* foo = nullptr;
if (!olmalloc_mode)
{
foo = malloc(bytes);
}
else if (olmalloc_mode == 1)
{
cudaMallocManaged(&foo, bytes);
}
else
{
fprintf(stderr, "invalid olmalloc mode %d\n", olmalloc_mode);
exit(1);
}
if (foo == nullptr)
{
fprintf(stderr, "olmalloc failed\n");
exit(1);
}
return foo;
}
__global__ void warmup()
{
return;
}
__global__ void inc(int* a, size_t n)
{
size_t i = blockDim.x * blockIdx.x + threadIdx.x;
for (; i < n; i += blockDim.x * gridDim.x)
{
a[i] += 1;
}
}
#define TPB 256
int main(int argc, char* argv[]) {
size_t N;
if (argc != 2 && argc != 3)
{
std::cerr << "Usage: " << argv[0] << " <size_t>\n";
return 1;
}
try
{
N = std::stoul(argv[1]);
if (argc == 3)
{
olmalloc_mode = std::stoi(argv[2]);
}
}
catch (const std::invalid_argument& e)
{
std::cerr << "Invalid argument: the input is not an unsigned integer.\n";
return 2;
}
catch (const std::out_of_range& e)
{
std::cerr << "Invalid argument: the input is out of range for a size_t.\n";
return 3;
}
const int blocks = calculateOptimalBlocks(inc, TPB);
int* a = (int*) olmalloc(sizeof(int) * N);
printf("Allocating %lu bytes\n", sizeof(int) * N);
printf("Allocating %lf gigabytes\n", sizeof(int) * N / 1e9);
printf("Kernel Config: %d, %d\n", blocks, TPB);
for (size_t i = 0; i < N; ++i)
{
a[i] = 3;
}
warmup<<<1, 1>>>();
cudaDeviceSynchronize();
auto start = std::chrono::high_resolution_clock::now();
inc<<<blocks, TPB>>>(a, N);
cudaDeviceSynchronize();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cerr << duration.count()/1e9 << std::endl;
std::cout << "runtime: " << duration.count()/1e9 << " seconds" << std::endl;
for (int i = 0; i < 10; ++i)
{
printf("a[%d] = %d\n", i, a[i]);
}
for (size_t i = 0; i < N; ++i)
{
assert(a[i] == 4);
}
}
tallen93@cci-hopper1:~/dev/hmm-eval/benchmarks/microbench/basic$ ./basic 50 1
Allocating 200 bytes
Allocating 0.000000 gigabytes
Kernel Config: 912, 256
0.000160561
runtime: 0.000160561 seconds
a[0] = 4
a[1] = 4
a[2] = 4
a[3] = 4
a[4] = 4
a[5] = 4
a[6] = 4
a[7] = 4
a[8] = 4
a[9] = 4
tallen93@cci-hopper1:~/dev/hmm-eval/benchmarks/microbench/basic$ ./basic 50 0
Allocating 200 bytes
Allocating 0.000000 gigabytes
Kernel Config: 912, 256
0.00033388
runtime: 0.00033388 seconds
a[0] = 4
a[1] = 4
a[2] = 4
a[3] = 4
a[4] = 4
a[5] = 4
a[6] = 4
a[7] = 4
a[8] = 4
a[9] = 4
yes, correct. With either HMM or ATS in effect, then host system ordinary allocator memory (malloc()) can be accessed from device code. ATS is the mechanism that enables Grace Hopper, and HMM may be enabled (in non-ATS cases) depending on how your system is set up .
Since this thread was about vLLM, it appears that PyTorch does not have Unified Memory support. So there is no UVM but only UVA and explicit memcopies.
It is still mysterious what the “UVM GPU BH1” process is carrying out.
There are multiple requests for UVM however there does not seem to be support yet.
opened 01:02AM - 24 Apr 24 UTC
feature
module: cuda
triaged
module: cuda graphs
module: CUDACachingAllocator
### 🚀 The feature, motivation and pitch
There is a growing need to mix-and-ma… tch cuda allocators in PyTorch for new NVIDIA architectures. For instance:
- Reductions on NVIDIA Switches ([NVLS Reductions](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html#nvlink-sharp-buffer-registration)), require buffers to have specific alignment (currently facilitated by ncclMemAlloc API). Buffers then need to be “registered” in the process group to finish the setup for the reduction.
- Extended GPU Memory (EGM) based all-gathers also require buffers to have specific alignment and optionally a NUMA location (can be specified by creating memory with cuMemCreate and CU_MEM_LOCATION_TYPE_HOST_NUMA).
Currently, a user cannot mark regions of pytorch code to use a different allocator, while using the default allocator for unmarked regions. The only two ways available (`BACKEND` environment variable, or `CUDAPluggableAllocator`) overrides the CUDACachingAllocator object globally, giving up the benefits of the CUDACachingAllocator.
We propose to expose Private Pools (in a more first-class manner) to the user land, along with the ability for the user to provide their own {allocator, deleter} functions to specify how the blocks in the pool should be allocated. In addition, users should be able to mark when to begin allocation to pool, when to end allocation to a pool, and finally when to destroy allocated pools.
## Proposed Approach
- A user can already create a private pool in PyTorch.
- `torch.cuda.graph_pool_handle()` provides a unique MemPool ID
- `torch._C._cuda_beginAllocateCurrentStreamToPool(...)` can use this ID and create a PrivatePool object in the CUDACachingAllocator managed by the `graph_pools` container.
- `torch._C._cuda_endAllocateCurrentStreamToPool(...)` can be used to mark when to stop allocating to a pool.
- `torch._C._cuda_releasePool` can be used to mark when the pool’s memory can be deleted or is safe to be reused by other consumers.
- We would like to encapsulate `torch.cuda.graph_pool_handle()` into a class:
```CPP
using CaptureId_t = unsigned long long;
using MempoolId_t = std::pair<CaptureId_t, CaptureId_t>;
using AllocFuncType = void*(size_t);
using DeleteFuncType = void(void*);
struct C10_CUDA_API MemPool {
CUDAMemPool(std::function<AllocFuncType> alloc_fn = {},
std::function<DeleteFuncType> delete_fn = {},
bool is_user_created=true);
// new members
std::mutex mutex_;
std::function<AllocFuncType> alloc_fn_;
std::function<DeleteFuncType> delete_fn_;
// mempool_id_ holding uid and uuid logic from graph_pool_handle()
static MempoolId_t mempool_id_;
}
```
- An “allocator” function pointer and “delete” function pointer is added (similar to CUDAPluggableAllocator). The atomic objects (`uid` and `uuid`) are changed to use a mutex so that we can have similar thread semantics for setting `alloc_fn_` and `delete_fn_`.
[`graph_pool_handle()`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAGraph.cpp#L19-L30) can then be modified to use the following to preserve functionality:
```CPP
auto new_pool = c10::cuda::MemPool();
return new_pool.mempool_id_;
```
- Additionally, [places](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAGraph.h#L31-L33) using `c10::cuda::MempoolId_t{0, 0}` can be changed to `c10::cuda::CUDAMemPool(false)`. `MempoolId_t{0, 0}` is used to track whether the pool is user created or not in CUDAGraphs [`capture_begin`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAGraph.cpp#L132-L143) implementation. Here, we can just change the logic to explicitly have a bool in the constructor of the MemPool, and never have a mempool_id that is {0,0}.
- Now that we have a MemPool object that can take an allocator and delete function pointers from the users, we can pass this info to the CUDACachingAllocator. We would like to add some function pointer members to BlockPool and PrivatePool, such that pools can use the function pointers when not null:
```CPP
BlockPool(bool small, PrivatePool* private_pool = nullptr,
AllocFnPtr allocator = nullptr,
DeleteFnPtr deleter = nullptr)
PrivatePool(AllocFnPtr allocator = nullptr, DeleteFnPtr deleter =
nullptr)
```
- We would create another container (`user_pools`) similar to [`graph_pools`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L917-L919) in CUDACachingAllocator, to distinguish from CUDAGraph related pool usage.
- We would create another variable similar to [`captures_underway`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L878-L883), such that we can pick between `graph_pools` or the `user_pools` in [`get_pool`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2185-L2203).
- `beginAllocateToPool` in the CUDACachingAllocator can then create the PrivatePools with the allocator and deleter functions. [`alloc_block`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2410-L2412) can be modified to use the function pointer:
```CPP
auto allocator = p.pool->allocator;
if (allocator) {
p.err = allocator(&ptr, size);
} else {
p.err = cudaMallocMaybeCapturing(&ptr, size);
}
```
- `release_block` can be modified similarly to use the deletor.
- Tensors can then utilize the user specified allocator and safely be deleted with the correct deletor.
- In a similar manner, we can modify [`endAllocateToPool`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1824C8-L1824C25) and [`releasePools`](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1838).
- We can then provide a context manager similar to [`_use_cuda_memory_pool_manager`](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L497) or use the APIs directly in Python:
```PYTHON
import torch
import torch.distributed as dist
from torch.distributed.distributed_c10d import _get_default_group
import os
from cuda import cuda
import ctypes
def nccl_mem_alloc(size):
nccl = ctypes.CDLL("/usr/lib/x86_64-linux-gnu/libnccl.so")
assert nccl.ncclMemAlloc is not None
ptr = ctypes.c_void_p()
err = nccl.ncclMemAlloc(ctypes.byref(ptr), ctypes.c_size_t(size))
if err != 0:
raise RuntimeError(f"Failed to allocate memory with ncclMemAlloc with error code: {err}")
return ptr
def nccl_mem_free(ptr):
nccl = ctypes.CDLL("/usr/lib/x86_64-linux-gnu/libnccl.so")
assert nccl.ncclMemFree is not None
err = nccl.ncclMemFree(ctypes.c_void_p(ptr))
if err != 0:
raise RuntimeError(f"Failed to free memory with ncclMemFree with error code: {err}")
pool = torch.cuda.MemPool(nccl_mem_alloc, nccl_mem_free)
device = torch.device("cuda:0")
stream = torch.cuda.Stream()
dist.init_process_group(backend='nccl')
default_pg = _get_default_group()
backend = default_pg._get_backend(device)
with torch.cuda.mempool(pool, device, stream):
special_tensor = torch.randn(2**32, device="cuda")
# Use in distributed for NVLS reduction (pseudocode)
backend.register_user_buffers(pool)
# collective uses NVLS reduction
dist.all_reduce(special_tensor)
```
### Alternatives
- CUDAPluggableAllocator
- CUDAPluggableAllocator can be used to override CUDACachingAllocator and the use cases mentioned above can be successfully implemented. However, it takes over the allocator for the entire lifetime of the program and once it gets enabled there is no way to get back to CUDACachingAllocator. Even if we can go back to using CUDACachingAllocator, it won’t be safe as tensors wouldn’t know the correct deletor function to use.
- Subclassing a Tensor
- We could use a similar approach to FBGEMM_GPU where we [modify the Storage](https://github.com/pytorch/FBGEMM/blob/86ea895c6e680d03a59283b40ae614f16a1a10ae/fbgemm_gpu/src/memory_utils/memory_utils.cu#L87-L130) of the tensor and plug in the user specified allocator and deletor there. However, this approach doesn’t compose and generalize well, and can only be used for very specific use cases like a custom UVM based embedding kernel in TorchRec.
### Additional context
cc: @ptrblck @eqy @Aidyn-A @zdevito @kwen2501 @minsii
Relevant PRs:
- https://github.com/pytorch/pytorch/issues/51075
- https://github.com/pytorch/pytorch/pull/86786
- https://github.com/pytorch/pytorch/pull/112238
- https://github.com/pytorch/pytorch/pull/112850
cc @ptrblck @mcarilli @ezyang @eellison @peterbell10
opened 08:27PM - 12 Nov 24 UTC
feature request
### 🚀 The feature, motivation and pitch
Is there a plan to support unified memo… ry using VLLM?
### Alternatives
_No response_
### Additional context
_No response_
### Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
opened 02:23PM - 27 Dec 24 UTC
question
I have access to a GH200 gpu and I'm trying to do model pretraining but when ru… nning the pretrain command i get Cuda out of memory error because litgpt isn't using the available unified memory of the chip.
<img width="674" alt="image" src="https://github.com/user-attachments/assets/e24e48d1-7189-4372-93f2-b4788bcbc370" />
`torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 94.88 GiB of which 172.19 MiB is free. Including non-PyTorch memory, this process has 94.69 GiB memory in use. Of the allocated memory 91.40 GiB is allocated by PyTorch, and 2.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Is there a way to use all the available memory?
opened 03:13AM - 29 Mar 20 UTC
module: internals
feature
low priority
module: memory usage
triaged
## 🚀 Feature
PyTorch Large Model Support (LMS) is a feature in the PyTorch pr… ovided by IBM here: [here (official IBM repo)](https://github.com/IBM/pytorch-large-model-support/) and [here (fork of the main maintener of LMS)](https://github.com/mtbrandy/pytorch) that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with "out-of-memory" errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.
With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.
## Motivation
* When training recurrent models with back-propagation through time (BPTT) it is often useful to 'truncate' the sequence length as little as possible, especially when dealing with audio inputs or EEG data that have high temporal resolution. This results in a larger memory footprint, and this is where LMS can save the day.
* Also, the amount of compute needed to train state-of-the-art models doubles on average every 3.5 months (see https://openai.com/blog/ai-and-compute/). This comes both from the use of larger batch sizes and the use of larger models (like the now famous GPT-2 with 1.5B parameters). For instance the Transformer-XL can have a big memory footprint (https://openai.com/blog/sparse-transformer/). Using LMS is very useful when you want to test something out without using gradients checkpointing right away.
* LMS can be extremely beneficial to anyone who cannot afford access to high-end GPUs (within small startups or in academic research). Using cloud services or buying the Titan RTX ($2,499) to run models is often too expensive.
* GPU RAM is most of the time limited to about 8GB and is not extensible. Regular RAM on the other hand can easily be increased up to 128GB or more and is underused during trainings.
* Finally, LMS could be useful when smoke testing runs with small GPUs (either manually or within the context of a CI). This leaves the small (often older) GPUs still busy while the larger ones are used for real runs with or without LMS.
## Pitch (copy/paste from the doc of LMS)
One or more elements of a deep learning model can lead to GPU memory exhaustion.
These include:
* Model depth and complexity
* Base data size (for example, high-resolution images)
* Batch size
Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.
## Alternatives
Checkpointing can some sometimes helps. But that not always the cases...
## Additional context
This feature is maintained for a while (since at least PyTorch 1.1) by @mtbrandy and is proposed for contribution to PyTorch since at least August 2019 (I did not found any mention of it on this repo):
https://www.reddit.com/r/pytorch/comments/cgyppk/large_model_support_for_pytorch/
https://discuss.pytorch.org/t/very-large-model-on-single-gpu/28962
It is as well mentionned here:
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_pytorch.html
Official repos:
https://github.com/IBM/pytorch-large-model-support/
https://github.com/mtbrandy/pytorch
-----
I am basically creating this issue because I really like LMS. So far I have waited the support of LMS for each version of PyTorch. Each time I had to manually compile PyTorch (and create wheels) to have the support of it. (BTW, many thanks to @mtbrandy that still maintains this fork).
The thing that I am missing is why this feature has not been integrated in PyTorch even though the code is made by IBM (and maintained) :sweat_smile:.
I mean, it needs an "opt-in" from the user, so it is not enabled by default! If the reason is "it can reduce the speed performance". I agree with you, but it can also allows people to experiment more without the need of a super-expansive GPU. I really think that the community, small start-ups, students etc. would benefits from this even if they will surely not use that most of the time.