I have this code running in host.Is there any way to accelarate the kernel launches between the loops?the data[0].t is mapped in memory with the cudaDeviceMapHost flag so i dont use cudamemcpy from device to host
Also , what is better?If a have an array stored in global memory , so i can access data (i refer to d_reaction variable) without losing them beetwen the kernels , OR is it better to store them in every kernel launch at shared memory?Which way is faster?
If you do the [font=“Courier New”]t += data[0].t[/font] inside the kernel and abort once it is larger than [font=“Courier New”]t_final[/font], you can speculatively launch the kernel several times at once without needing to synchronize in between.
And you might be able to do without the mapped memory then (you might even have to, depending on how you organize it - remember there are no atomic operations on mapped memory).
For example is better to copy 8kb from deviece to host with cudamemcpy and to run kernel one time or to copy 4kb and from deviece to host with cudamemcpy and to run kernel two times?