GPU out of memory when the total ram usage is 2.8G

Hi everyone,
I have an issue when I run a caffe model(CMU Openpose: GitHub - CMU-Perceptual-Computing-Lab/openpose: OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation) on Jetson tx2.

Error:
syscedmem.cpp:71 Check failed: error == cudaSuccess(2 vs. 0) out of memory.

I switched the nvpmodel to model 0.
./tegraststs showed: RAM 2861/7854MB, only one cpu is 100%, GPU is 75%@1300 most.

When I run this caffe model on other GPU laptop, the most GPU memory used is 1.5GB.

Thanks.

More Error information:
Memory need is 1931264
GPU is 0


Memory need is 860016
GPU is 0
Memory need is 55041024
GPU is 0
Memory need is 20952
GPU is 0
Memory need is 55041024
GPU is 0
syscedmem.cpp:73 Check failed: error == cudaSuccess(2 vs. 0) out of memory.

My modified syscedmem.cpp is:

inline void SyncedMemory::to_gpu() {
check_device();
#ifndef CPU_ONLY
switch (head_) {
case UNINITIALIZED:
std::cout << "Memory need is " << size_ << “\n”;
std::cout << "GPU is " << gpu_ptr_ << “\n”;
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
caffe_gpu_memset(size_, 0, gpu_ptr_);

More Error information:
Memory need is 1931264
GPU is 0
Total byte is 8235577344
Free byte is 4501069824


Memory need is 860016
GPU is 0
Total byte is 8235577344
Free byte is 4081430528
Memory need is 55041024
GPU is 0
Total byte is 8235577344
Free byte is 4079366144
Memory need is 20952
GPU is 0
Total byte is 8235577344
Free byte is 4024004608
Memory need is 55041024
GPU is 0
Total byte is 8235577344
Free byte is 4024004608
syscedmem.cpp:78 Check failed: error == cudaSuccess(2 vs. 0) out of memory.

My modified syscedmem.cpp is:

inline void SyncedMemory::to_gpu() {
check_device();
#ifndef CPU_ONLY
switch (head_) {
case UNINITIALIZED:
std::cout << "Memory need is " << size_ << “\n”;
std::cout << "GPU is " << gpu_ptr_ << “\n”;
size_t free_byte;
size_t total_byte;
cudaMemGetInfo(&free_byte, &total_byte);
std::cout << "Total byte is " << total_byte << “\n”;
std::cout << "Free byte is " << free_byte << “\n”;
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
caffe_gpu_memset(size_, 0, gpu_ptr_);

Just a thought…the laptop will likely use memory from the video device, but Jetsons must use main system memory. Try enabling swap in the kernel if not already enabled (check “/proc/config.gz” for “CONFIG_SWAP=y”), then add an SD card or SATA disk and create a swap file or format a partition as swap (the “swapon” command can point at either a loopback swap formatted file or a partition formatted for swap…see “man mkswap”).

There are other requirements for GPU memory, but adding swap might take some pressure off of physical RAM from other programs and make more available to GPU.

Does the GPU need a contiguous allocation for each object? If so, fragmentation may also be a problem.

I think it does require contiguous…this is one of those “other requirements”. Swapping out other use of RAM may lead to a bit more being available, but kernel command line options may be needed if larger amounts are failing for reason of not being contiguous. It is easy to try swap and not bother with kernel command line options to test out if that does the job…you’d probably still need swap anyway.

If you make sure to start your GPU process as early as possible in boot, and pre-allocate all memory that you will need, then you don’t need any VM or additional problems.
This is a thing that’s different about embedded compared to desktop PCs – you can have full control, but you also have very fixed resources that you have to know how to manage.
This is very similar to a game console target, TBH.

Hi Superying,

Could you try to add some swap space?

# Create a swapfile for Ubuntu at the current directory location
fallocate -l 8G swapfile
# List out the file
ls -lh swapfile
# Change permissions so that only root can use it
chmod 600 swapfile
# List out the file
ls -lh swapfile
# Set up the Linux swap area
mkswap swapfile
# Now start using the swapfile
sudo swapon swapfile
# Show that it's now being used
swapon -s

Hi AastaLLL,

I try your method, but it does not work. The problem “F0616 03:17:41.486484 2017 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory” still exists.

Hi,

Just check my device, it can allocate 7131MByte as maximum.
Could you also check this in your environment?

#include <stdio.h>
#include "cuda.h"
#define ONE_MBYTE (1024*1024)

void printMemInfo()
{
    size_t free_byte ;
    size_t total_byte ;
    cudaError_t cuda_status = cudaMemGetInfo( &free_byte, &total_byte ) ;

    if ( cudaSuccess != cuda_status ){
        printf("Error: cudaMemGetInfo fails, %s\n", cudaGetErrorString(cuda_status));
        exit(1);
    }

    double free_db = (double)free_byte ;
    double total_db = (double)total_byte ;
    double used_db = total_db - free_db ;

    printf("GPU memory usage: used = %.2f MB, free = %.2f MB, total = %.2f MB\n", used_db/ONE_MBYTE, free_db/ONE_MBYTE, total_db/ONE_MBYTE);
}

int main(){
    void *p[10000];

    int amount = 0;
    while(true){
        printMemInfo();
        cudaError_t rval = cudaMalloc( &p[amount], ONE_MBYTE);
        printf( "cudaAlloc( ..., %dMByte, ... ) returns %d\n", amount, rval );

        if( rval != cudaSuccess ) break;
        amount++;
    }
    for(int i=0; i<amount; i++)
        cudaFree(p[i]);
    return 0;
}

Hi AastaLLL,

I run your code, it can allocate ~7GB as maximum, but if i modify ONE_MBYTE to be ONE_GBYTE, it failed. I guess it can’t allocate large continuous system memory.

Thanks.

Hi,

If you only need to allocate a 2GB memory, it should be fine on TX2.
Just check the maximal chunk can be allocated is 3800MiB.

#include <stdio.h>
#include "cuda.h"
#define ONE_MBYTE (1024*1024)

void printMemInfo()
{
    size_t free_byte ;
    size_t total_byte ;
    cudaError_t cuda_status = cudaMemGetInfo( &free_byte, &total_byte ) ;

    if ( cudaSuccess != cuda_status ){
        printf("Error: cudaMemGetInfo fails, %s\n", cudaGetErrorString(cuda_status));
        exit(1);
    }

    double free_db = (double)free_byte ;
    double total_db = (double)total_byte ;
    double used_db = total_db - free_db ;

    printf("GPU memory usage: used = %.2f MB, free = %.2f MB, total = %.2f MB\n", used_db/ONE_MBYTE, free_db/ONE_MBYTE, total_db/ONE_MBYTE);
}

int main(){
    void *p;

    int amount = 0;
    while(true){
        cudaError_t rval = cudaMalloc( &p, long(amount)*ONE_MBYTE);
        printf( "cudaAlloc( ..., %dMByte, ... ) returns %d\n", amount, rval );
        printMemInfo();

        if( rval != cudaSuccess ) break;
        amount += 100;
        cudaFree(p);
    }
    return 0;
}

Hi everyone,

Thanks for your reply.

In my case, each time before the crash for “out of memory”, the memory used is approaching 4GB.
Base on AastaLLL’s reply in the topic(https://devtalk.nvidia.com/default/topic/1004110/jetson-tx2/memory-for-gpu-so-small-/post/5169097/?offset=8#5172677).
Nvidia limits this continue small memory size allocation with a maximum 4GB.
I think this is the reason for my case. And wait for their next release.

Hi,

More clear:

For cudaAllocHost and cudaMallocManaged:
The total allocated amount is ~4G. More precisely, it’s the half size of physical memory.

For cudaMalloc:
The total allocated amount is ~8G. But each allocation needs to be smaller than ~4G.

We already remove this limitation in our next release.
Thanks and sorry for the inconvenience.

Is there any way to move the memory address pointer in such a way the cpu will allocate the 1st 4G memory bank and the GPU to allocate memory from the 2nd 4G memory bank.

So , we can avoid the 4G split limitation.

For example, if cudaMalloc()/cudaFree() starts to see the physical memory from the same address as the Malloc()/Free() as the OS that means we can not override the limitation. what if each device (gpu and cpu) can start allocating memory from different starting point?

Is that possible to implement it? I know that may require tweaking the OS where the memory management happened…

By the way, our application needs 2G allocated for GPU and the rest from for the CPU threads

Hi,

This limitation is from our CUDA driver. All the memory used by GPU need to pass CUDA driver.

By the way, for desktop GPUs, it’s available to allocate memory via Malloc() and then registers it to GPUs with cudaHostRegister().
But cudaHostRegister() doesn’t support Jetson platform.
Since on ARM platform, the caching attribute of an existing allocation can’t be changed on the fly.

Please wait for our next release.
Sorry for the inconvenience.

hi,

do you have any comment on when the next Jetpack release will be available?

Thank you a lot for your quick response!!!

Hi,

Sorry for that we can’t disclosure our schedule.
Please wait for our announcement and update.

Thanks.

Hello. What version is expect to fix the issue with ram allocation? So we keep an eye on new versions release to update immediately.

Thanks!

Hi,

We fix this issue in the rel-28 branch and package this into JetPack3.1.
Thanks.