I am relatively very new to CUDA. Please excuse me if I repeat the questions. My work in cuda requires processing of huge files ~ 46GB, which can not fit into either host or device memory.
To implement the code, I was hoping to memcpy device to host a separate stream and/or use textured memory. Am I on the right track.
I understand that CUDA is relatively low level, but is it possible to write generic code allocate the largest memory buffer?
Any article that discusses details about how to process huge data over GPU could be helpful to me.
Thanks!
To answer your generic question with a generic answer, processing of data that cannot fit into system or device memory, that is, out-of-core processing, is done by performing the work in chunks that do fit into the memory. How you do the partitioning of the data into chunks to facilitate reasonably efficient computation is typically domain specific, I would suggest consulting the literature relevant to your use case.
I am sorry, maybe my question was vague. Let me rephrase the question.
So, GPU has a limited memory resources - Registers, Local, Shared, global, constant, texture etc. I understand that work chunking is a domain specific problem. However, there is a hardware limitation with the device. Let’s assume I am working on GPU that has a global memory of 2045 MB. What I am trying to ask is,
Will there implication if I allocate the entire global memory to the buffer (greedy approach), or should I keep some memory for internal use?
Is there any way we can identify the maximum available memory runtime?