Generic memory management & optimization

I am relatively very new to CUDA. Please excuse me if I repeat the questions. My work in cuda requires processing of huge files ~ 46GB, which can not fit into either host or device memory.

To implement the code, I was hoping to memcpy device to host a separate stream and/or use textured memory. Am I on the right track.

I understand that CUDA is relatively low level, but is it possible to write generic code allocate the largest memory buffer?

Any article that discusses details about how to process huge data over GPU could be helpful to me.

To answer your generic question with a generic answer, processing of data that cannot fit into system or device memory, that is, out-of-core processing, is done by performing the work in chunks that do fit into the memory. How you do the partitioning of the data into chunks to facilitate reasonably efficient computation is typically domain specific, I would suggest consulting the literature relevant to your use case.

I am sorry, maybe my question was vague. Let me rephrase the question.
So, GPU has a limited memory resources - Registers, Local, Shared, global, constant, texture etc. I understand that work chunking is a domain specific problem. However, there is a hardware limitation with the device. Let’s assume I am working on GPU that has a global memory of 2045 MB. What I am trying to ask is,

  1. Will there implication if I allocate the entire global memory to the buffer (greedy approach), or should I keep some memory for internal use?
  2. Is there any way we can identify the maximum available memory runtime?

Thanks a lot for the quick response. :-)

try cudaMemGetInfo ( size_t* free, size_t* total ).
to get free & total size of the device-memory.

Thanks a lot episteme, I will definitely use this. Will there be any performance implication if I use all the free memory?

you can use error checking method to test whether the memory allocation is successful or not.

Basically, wrap each API call with the gpuErrchk macro.

I recently started using a stealthy and aesthetic CUDA error checking macro.

A CUDA Runtime API function can be transformed into its error-checked equivalent by parenthesizing everything after the lower case “cuda”.


cuda(Malloc(&vin_d, bytes));

The C99/C++ macro:

#include <stdbool.h>

cuda_assert(const cudaError_t code, const char* const file, const int line, const bool abort)
  if (code != cudaSuccess)
      fprintf(stderr,"cuda_assert: %s %s %d\n",cudaGetErrorString(code),file,line);

      if (abort)

#define cuda(...) cuda_assert((cuda##__VA_ARGS__), __FILE__, __LINE__, true);