question about accelarating a kernel in a loop

I have this code running in host.Is there any way to accelarate the kernel launches between the loops?the data[0].t is mapped in memory with the cudaDeviceMapHost flag so i dont use cudamemcpy from device to host

Also , what is better?If a have an array stored in global memory , so i can access data (i refer to d_reaction variable) without losing them beetwen the kernels , OR is it better to store them in every kernel launch at shared memory?Which way is faster?

while(t <= t_final)

    {	

	frm_kernel <<<1, num_reactions, shared_size>>> (state, d_reaction, d_species, d_data);

    	

    	cudaThreadSynchronize();

	t += data[0].t;

		

        fprintf(pfile2,"%f,%d\n",data[0].t,data[0].pos);

    	

    }

If you do the [font=“Courier New”]t += data[0].t[/font] inside the kernel and abort once it is larger than [font=“Courier New”]t_final[/font], you can speculatively launch the kernel several times at once without needing to synchronize in between.
And you might be able to do without the mapped memory then (you might even have to, depending on how you organize it - remember there are no atomic operations on mapped memory).

If I don’t use the mapped memory, I 'll use the global or shared and after kernel I use cudamemcpy

Do anyone no the best size for cudamemcpy?

For example is better to copy 8kb from deviece to host with cudamemcpy and to run kernel one time or to copy 4kb and from deviece to host with cudamemcpy and to run kernel two times?

The larger the better.

I can’t reduce time and I think to use stream but I don’t understand exactly how to use it.

__void__ mykernel(...,data_t *data,...

{

  ...

  for(i=0;i<BUFSIZEi++)

  {

    ...

    if( idx == 0)

    {

      data[i].t=...;

      data[i].pos=...;

    }

  }

}

main()

{

  ...

  while(t < t_final)

  {

    mykernel<<<1,num>>> ( d_data)

cudaMemcpy(data,d_Data,BUFZISE*sizeof(data_t),D2H);

for(i=0;i<BUFSIZE && data[i].t != -1.0f;i++)

    fprintf(pfile2,"%f,%d\n",data[i].t,data[i].pos);

}

...

}

The data variable is only for output array from kernel. I used shared but the time reduced slightly that is the reason that I ask about streams