Hi all,
While I’m not a complete newcomer to C++ programming, I’m not well-trained in memory management. I was hoping to run a framework for some code by you all to make sure that I’m setting things up in an effective way. I’ve looked through documentation but I’m still not confident about my understanding.
Anyway, the host thread/GPU kernel I’m working on should do the following:
[*]Store some (large) 3-D matrices of floats on the GPU (corresponding to physical spatial position in the simulation)
[*]Store some ints and doubles as well (physical constants and grid counters)
[*]Run some operations on the matrices, updating them
[*]Return the matrices to the host
So I assume that I’m going to use cudamalloc3d() to allocate the matrices on the GPU, then cudaMemcpy() to write the matrices, then store the ints and doubles as constants, then have some threads in some blocks with each thread updating a particular element in the 3d volume, then calling cudaMemcpy() again to put the values back onto the host. So far as I can tell I don’t want to mess with textures or any other kind of memory because I’ll be updating the large areas, but I do want constants to be constant because every thread will be reading them? Do I want to mess with zero-copy memory or would the need to update this on the cpu also ruin any transfer rate advantage?
Also, it’s possible that I’ll have upwards of 65535512 elements (512^3, for example). I’ve seen conflicting info on how max block sizes per kernel call work; Is it 65,535 blocks maximum, or 65,53565,535 blocks maximum (that is, a maximum of 65,535 blocks in each dimension)? Also, even on Tesla cards, I’ll run out of memory with multiple floats per element, no? Is there no way to prevent errors here other than manually confirming that the GPU will have enough memory and not running anything that requires larger matrices than that?
Finally, after a proof-of-concept I want to optimize performance for multiple hosted GPUs, which means swapping only boundary conditions between GPUs working on adjacent grids of volume elements rather than swapping gigabytes of arrays over PCIE with every timestep. Is there a way to leave the allocated matrices on the cuda device between kernel calls and only update certain elements of it as the program begins? (I suspect that I could simply not call cudaFree() and retain the pointer but I was raised on new, not malloc.)
Thanks very much for reviewing this plan (or blunderbuss-blast of questions) for me, and any expert advice is greatly appreciated!