Ideal memory storage on GPUs

sepia.latimanus · June 8, 2011, 5:28pm

Hi all,

While I’m not a complete newcomer to C++ programming, I’m not well-trained in memory management. I was hoping to run a framework for some code by you all to make sure that I’m setting things up in an effective way. I’ve looked through documentation but I’m still not confident about my understanding.

Anyway, the host thread/GPU kernel I’m working on should do the following:

[*]Store some (large) 3-D matrices of floats on the GPU (corresponding to physical spatial position in the simulation)

[*]Store some ints and doubles as well (physical constants and grid counters)

[*]Run some operations on the matrices, updating them

[*]Return the matrices to the host

So I assume that I’m going to use cudamalloc3d() to allocate the matrices on the GPU, then cudaMemcpy() to write the matrices, then store the ints and doubles as constants, then have some threads in some blocks with each thread updating a particular element in the 3d volume, then calling cudaMemcpy() again to put the values back onto the host. So far as I can tell I don’t want to mess with textures or any other kind of memory because I’ll be updating the large areas, but I do want constants to be constant because every thread will be reading them? Do I want to mess with zero-copy memory or would the need to update this on the cpu also ruin any transfer rate advantage?

Also, it’s possible that I’ll have upwards of 65535512 elements (512^3, for example). I’ve seen conflicting info on how max block sizes per kernel call work; Is it 65,535 blocks maximum, or 65,53565,535 blocks maximum (that is, a maximum of 65,535 blocks in each dimension)? Also, even on Tesla cards, I’ll run out of memory with multiple floats per element, no? Is there no way to prevent errors here other than manually confirming that the GPU will have enough memory and not running anything that requires larger matrices than that?

Finally, after a proof-of-concept I want to optimize performance for multiple hosted GPUs, which means swapping only boundary conditions between GPUs working on adjacent grids of volume elements rather than swapping gigabytes of arrays over PCIE with every timestep. Is there a way to leave the allocated matrices on the cuda device between kernel calls and only update certain elements of it as the program begins? (I suspect that I could simply not call cudaFree() and retain the pointer but I was raised on new, not malloc.)

Thanks very much for reviewing this plan (or blunderbuss-blast of questions) for me, and any expert advice is greatly appreciated!

Topic		Replies	Views
allocating double pointer memory in GPU CUDA Programming and Performance	3	11787	February 3, 2011
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4673	April 23, 2009
Some Guidance on optimal approach to batch Matrix Multiply GPU-Accelerated Libraries cuda	0	448	August 11, 2020
Grids and Threads question CUDA Programming and Performance	2	4426	August 7, 2007
Should I use constant memory or Texture? CUDA Programming and Performance	8	11560	February 20, 2008
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1982	January 26, 2011
Hide latency CUDA Programming and Performance	3	535	June 9, 2023
How to determine the Block Size CUDA Programming and Performance	1	5919	September 4, 2009
Max threads/blocks CUDA Programming and Performance	10	91	September 6, 2024
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6082	November 16, 2010

Ideal memory storage on GPUs

Related topics