I’m quite new to GPGPU computing, I’ve read through NVIDIA’s OpenCL and Cuda guides and still have a question about how to best leverage the capabilities of CUDA in my particular case, which typically looks like this:
â€¢ the system to simulate is composed of an artificial neural network coupled to a physics engine
â€¢ one wants to run a few hundred of such systems in parallel, for a specified number of iteration, and then collect some statistics about the results.
â€¢ a single run can be described as follows:
[codebox]while time < maxTime:
take one neural network step feed output into physics engine take one physics step feed output into neural network
return some statistics
In an ideal world, the physics engine should also run on the GPU so as to minimise memory transfers between host and device. However I don’t know of any GPU-only physics engine around there and I don’t think I can write one myself ;-) Moreover there are things that MUST run on the CPU, for example one might replace the physics engine with data from actual sensors and the same problem would arise: how do we efficiently stream data to and from the gpu in such a situation?
So… would you
(a) invoke the neural network kernel once for each timestep (a typical run is 500-5,000 timesteps): straightforward, but possibly very inefficient, since data structures in shared memory would be lost between invocations and would need to be reallocated every time?
(b) invoke the neural network kernel once for the entire simulation, and have the host write and read data to/from the device’s global memory repeatedly during the kernel’s lifetime: potentially more efficient, but (correct me if I’m wrong) non-deterministic because it is not possible to synchronise at this level?