Generic memory management & optimization

Vectoror · October 29, 2015, 8:31pm

I am relatively very new to CUDA. Please excuse me if I repeat the questions. My work in cuda requires processing of huge files ~ 46GB, which can not fit into either host or device memory.

To implement the code, I was hoping to memcpy device to host a separate stream and/or use textured memory. Am I on the right track.

I understand that CUDA is relatively low level, but is it possible to write generic code allocate the largest memory buffer?

Any article that discusses details about how to process huge data over GPU could be helpful to me.
Thanks!

njuffa · October 29, 2015, 8:42pm

To answer your generic question with a generic answer, processing of data that cannot fit into system or device memory, that is, out-of-core processing, is done by performing the work in chunks that do fit into the memory. How you do the partitioning of the data into chunks to facilitate reasonably efficient computation is typically domain specific, I would suggest consulting the literature relevant to your use case.

Vectoror · October 29, 2015, 9:05pm

I am sorry, maybe my question was vague. Let me rephrase the question.
So, GPU has a limited memory resources - Registers, Local, Shared, global, constant, texture etc. I understand that work chunking is a domain specific problem. However, there is a hardware limitation with the device. Let’s assume I am working on GPU that has a global memory of 2045 MB. What I am trying to ask is,

Will there implication if I allocate the entire global memory to the buffer (greedy approach), or should I keep some memory for internal use?
Is there any way we can identify the maximum available memory runtime?

Thanks a lot for the quick response. :-)

episteme · October 30, 2015, 12:47pm

try cudaMemGetInfo ( size_t* free, size_t* total ).
to get free & total size of the device-memory.

Vectoror · October 31, 2015, 10:50pm

Thanks a lot episteme, I will definitely use this. Will there be any performance implication if I use all the free memory?

LongY · November 5, 2015, 5:41am

you can use error checking method to test whether the memory allocation is successful or not. [url]What is the canonical way to check for errors using the CUDA runtime API? - Stack Overflow

Basically, wrap each API call with the gpuErrchk macro.

allanmac · November 5, 2015, 6:19am

I recently started using a stealthy and aesthetic CUDA error checking macro.

A CUDA Runtime API function can be transformed into its error-checked equivalent by parenthesizing everything after the lower case “cuda”.

Examples:

cuda(GetDeviceProperties(&props,device));
cuda(SetDevice(device));
cuda(Malloc(&vin_d, bytes));
cuda(Memset(vin_d,0,bytes));
cuda(Free(vin_d));
cuda(EventCreate(&end));
cuda(EventRecord(end));
cuda(EventSynchronize(end));
cuda(EventElapsedTime(&elapsed,start,end));
cuda(EventDestroy(end));
cuda(DeviceReset());

The C99/C++ macro:

#include <stdbool.h>

static
void
cuda_assert(const cudaError_t code, const char* const file, const int line, const bool abort)
{
  if (code != cudaSuccess)
    {
      fprintf(stderr,"cuda_assert: %s %s %d\n",cudaGetErrorString(code),file,line);

      if (abort)
        {
          cudaDeviceReset();          
          exit(code);
        }
    }
}

#define cuda(...) cuda_assert((cuda##__VA_ARGS__), __FILE__, __LINE__, true);

Topic		Replies	Views
Device memory size CUDA Programming and Performance	11	47034	June 6, 2008
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4734	April 23, 2009
Splitting large datasets to fit into device memory Algorithm, implementation, and problems CUDA Programming and Performance	1	7493	April 25, 2009
Cuda memory size and types allocating large chunks of memory CUDA Programming and Performance	10	2132	August 9, 2010
Multi GPU, Windows 10 pagefile and global memory issues CUDA Programming and Performance	3	2332	July 31, 2018
How much do i allocate global memory? CUDA Programming and Performance	1	1237	January 8, 2009
Memory problem CUDA Programming and Performance	9	6710	March 20, 2008
Accurately determining available global memory on a CUDA device CUDA Programming and Performance	2	14460	April 11, 2011
Running out of global memory CUDA Programming and Performance	9	2458	December 10, 2021
Limit on cublasAlloc? CUDA Programming and Performance	16	10889	October 2, 2010

Generic memory management & optimization

Related topics