GPU: 1 Tesla M2090
I am trying to solve a problem in which the amount of data which needs to be sent back to the CPU is not known before you run the GPU kernel. Basically Orbit propagation/ODE integration with result data at every time step. Kernel is compute intensive.
I have never dealt with such kind of data management problem before and would like to know a good way to handle the problem …
I have been doing some digging over the internet( stackoverflow and here etc) and it looks like I have the following options:
Allocate ALL 6 GB memory and then write to it and use another kernel to figure out how much memory needs to be transferred back. The problem is what happens if my kernel produces more than 6 GB of data ? Plus it can be slow.
Use Zero copy: As my kernel does lot of computation per thread and the results are written only once. Can I use this ? I have no experience with zero copy , but it seems like a good candidate to directly write data to host memory. (I have 24 GB RAM so the memory size is not a huge issue)
Dynamically Allocate memory on device and then copy it to the Host. I found out that this is no longer supported since cuda 4.1 and Nvidia is working on a fix ? Maybe I am wrong , the 4.1 programming guide on pg 108 says it can be done
“Memory allocated via malloc() can be copied using the runtime (i.e. by calling any of the copy memory functions from Sections3.2.2).”
So am confused as to which is the best way forward… or if there is a better way of solving this problem ?
All my computations are in double precision.
Any help or inputs are welcome… thanks for the help.