Memory Error in TX2

Hi,everyone.
I create 4 threads in 1 process,each thread uses ‘cudaMallocHost()’API function to allocate memory.The total size of memory is about 3.4G(less than 4G).The process runs OK.But when creating 5 or 6 threads,the allocating total size of memory is more than 4G,it will return an error(out of memory).
Then I have a testing with 2 processes,each process creates 3 threads,and the total memory sizes is about 5.6G(more than 4G).The processes running are OK.
Now my quesstion is:
Why allocating total size of memory more than 4G will return ‘out of memory’ error in one process,but in 2 processes is OK?
Thanks.
1.png
2.png

Are you running an old L4T ?
This should be fixed in R28.1.
See this topic: https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/post/5170376/#5170376

Hi,

This is a known issue and already fixed in rel-28.
Please reflash your device with JetPack3.1.

Thanks.

Thank you for your reply.
I reflash the device with JetPack3.1(R28.1),but the problem remains unsolved.I run the same testing demo,it still return ‘out of memory’ error in 1 process.In 2 processes it is OK.

Hi,

Could you help us check the tegrastats value?

sudo ~/tegrastats

Thanks.

I upload the screenshot pictures as attachments.Please see above.Thank you.

Hi,

We tried to reproduce this issue with following source code, but all things work correctly.

#include <stdio.h>
#include <thread>
#include <cuda_runtime_api.h>
#include "cuda.h"
#define ONE_MBYTE (1024*1024)

void printMemInfo()
{
    size_t free_byte ;
    size_t total_byte ;
    cudaError_t cuda_status = cudaMemGetInfo( &free_byte, &total_byte ) ;

    if ( cudaSuccess != cuda_status ){
        printf("Error: cudaMemGetInfo fails, %s\n", cudaGetErrorString(cuda_status));
        exit(1);
    }

    double free_db = (double)free_byte ;
    double total_db = (double)total_byte ;
    double used_db = total_db - free_db ;

    printf("GPU memory usage: used = %.2f MB, free = %.2f MB, total = %.2f MB\n", used_db/ONE_MBYTE, free_db/ONE_MBYTE, total_db/ONE_MBYTE);
}

void allocate(void* ptr, int amount)
{
    cudaError_t rval = cudaMalloc( &ptr, amount*ONE_MBYTE);
    printf( "cudaAlloc( ..., %uMByte, ... ) returns %d\n", amount, rval );
}

int main(int argc, char* argv[]){

    if( argc<2 ) {
        printf("Please enter number of thread\n");
        exit(0);
    }

    int numThread = atoi(argv[1]);
    printf("Create %d threads\n", numThread);
    printMemInfo();

    void *p[numThread];
    std::thread t[numThread];

    for( int i=0; i<numThread; i++) t[i] = std::thread(allocate,p[i], 800);  //800Mb
    for( int i=0; i<numThread; i++) t[i].join();
    printMemInfo();

    for(int i=0; i<numThread; i++) cudaFree(p[i]);

    return 0;
}
nvcc topic_1023797.cpp -std=c++11 -o test && ./test 9

We create night threads and each thread malloc 0.8G (an approximation for your use-case).
No error occurs. And we can allocate memory to 7737.47Mb.

nvidia@tegra-ubuntu:~$ nvcc topic_1023797.cpp -std=c++11 -o test && ./test 9
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Create 9 threads
GPU memory usage: used = 1639.79 MB, free = 6210.92 MB, total = 7850.71 MB
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
cudaMallocHost( ..., 800MByte, ... ) returns 0
GPU memory usage: used = 7737.47 MB, free = 113.23 MB, total = 7850.71 MB

Could you help us check how to reproduce this issue via above sample?
Thanks.