Cublas context host RAM footprint >400M

Hey,

I’m using massif tool (part of valgrind toolchain) to profile host memory usage in my C++ code on Linux. I’m noticing that creating a cublas context alone consumes more than 400M of RAM, per valgrind report. Hence a couple of questions. Is this kind of memory footprint normal? If so, is there a SIMPLE way to reduce the footprint significantly?

Here’s my test code and run instructions (save as context-tst.cpp):

#include <iostream>

#include <unistd.h>

#include <cuda.h>
#include <cublas_v2.h>

int main() {

  cublasHandle_t cublasHandle;

  cublasStatus_t status = cublasCreate(& cublasHandle);

  if (CUBLAS_STATUS_SUCCESS == status)
    std::cerr << "CUDA BLAS context creation succeeded" << std::endl;
  else {
    std::cerr << "CUDA BLAS context creation failed, status: " << status
	      << std::endl;
    exit(1);
  }

  sleep(5);

  cublasDestroy(cublasHandle);

  sleep(5);

  return 0;
}

To execute:

  1. Compile: nvcc -pg -lcublas context-tst.cpp -o context-tst
  2. Profile: valgrind --tool=massif ./context-tst
  3. Report: ms_print massif.out.<SUBSTITUTE_YOUR_PID_HERE> | less

The essence of the RAM allocation report is below:

MB
414.3^                                                                      : 
 |                                                                   @:#::
 |                                                               @:@:@:#::
 |                                                              @@:@:@:#::
 |                                                         ::::@@@:@:@:#::
 |                                                     @:@@::::@@@:@:@:#::
 |                                                  @@@@:@ ::::@@@:@:@:#::
 |                                                ::@@@@:@ ::::@@@:@:@:#::
 |                                           @@:::::@@@@:@ ::::@@@:@:@:#::
 |                                         @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                                    :::::@:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                               :  :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                               ::::@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                           @:@@:: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                     @@@:::@ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |                  @@@@@ :: @ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |              @::@@ @@@ :: @ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |        :: :::@: @@ @@@ :: @ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |      ::: ::: @: @@ @@@ :: @ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::
 |    @:: : ::: @: @@ @@@ :: @ @ :: :@: :: @:@ : :::@@@@:@ ::::@@@:@:@:#::

0 ±---------------------------------------------------------------------->Gi
0 1.641

I’m also noting a bunch of warnings produced by valgrind:
>
> ==1303== Massif, a heap profiler
> ==1303== Copyright (C) 2003-2017, and GNU GPL’d, by Nicholas Nethercote
> ==1303== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==1303== Command: ./context-tst
> ==1303==
> ==1303== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x25 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x17 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x19 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x49 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x21 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x1b with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x44 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> ==1303== Warning: noted but unhandled ioctl 0x48 with no size/direction hints.
> ==1303== This could cause spurious value errors to appear.
> ==1303== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
> CUDA BLAS context creation succeeded
> ==1303==
> ==1303== Process terminating with default action of signal 27 (SIGPROF)
> ==1303== at 0xD464F85: pthread_cond_timedwait@@GLIBC_2.3.2 (futex-internal.h:205)
> ==1303== by 0x1CB2B0EE: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.39)
> ==1303== by 0x1CC24C35: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.39)
> ==1303== by 0xD45E6DA: start_thread (pthread_create.c:463)
> ==1303==
> Profiling timer expired

Thank you in advance for your thoughts!

The discussion here is around host memory, not device memory. According to my simplistic testing with top, I observe approximately 0.0% host memory usage when a CUDA context is created, but approximately 0.3% of 192GB host memory used after the CUBLAS handle creation.

I don’t know why CUBLAS uses this memory on context creation, but I also doubt there are any simple ways to affect it. You might wish to file a bug.

1 Like

Hey Robert,

The discussion here is around host memory, not device memory.

This is correct. I should have mentioned it explicitly.

I observe … approximately 0.3% of 192GB host memory used after the CUBLAS handle creation

Our numbers very roughly match, yours being a bit worse than mine: 0.3% x 192GB = 576MB.

You might wish to file a bug.

Per Robert’s suggestion, I filed: bug