Running Finite-Difference Time-Domain loop on device

Is it possible to run the time-stepping loop of a Finite-Difference Time-Domain (FDTD) scheme completely within a kernel running on the device? I understand that if a display is connected to the graphics card shared with the GPU, kernel execution time is limited. However, is there a way to synchronize all of the threads, thereby effectively bypassing this time limit?

I need to run calculations within the time loop, and then swap 2D textures. For example, give pitch linear texture memory variables pnsub1, pn, pnplus1, I would like to do the following:

pnsub1 = pn;

pn = pnplus1;

Is there a way to swap 2D texture memory on the device? Would I use cudaMemcpy or a similar function?

The following code snippet shows what I would like to achieve with the kernel running on the device.

__global__ void run_kernel()


   for(int t = 0; t < tN; i++) 


     // perform calculations in here

// swap 2D textures here


// synchronize threads here?


In principle it is possible to achieve something similar using persistent threads.
However, what advantage are you hoping for compared to multiple kernel launches?

Thank you for your response, tera! Since the timestep of my simulation is very small, I am hoping to have the GPU run the simulation at a quicker rate than what might be achieved by having the host processor trigger each kernel.

If the time loop is executed on the host, and the host triggers each kernel, then must the host also swap the matrices? If the host starts each kernel and then swaps the matrices, then wouldn’t this affect performance? Will there be a performance penalty by having the host start each kernel?

What is the best way to do what I am wanting to do?

Do you really want to swap matrixes is memory or just swap pointers? The latter should be no problem at all.

I haven’t checked on my own, since my kernels are running far longer, but kernel launch overhead should be something like 10us or better. My personal advice would be to use a kernel invocation per timestep first. and get that working. Then if kernel launch overhead turns out to be significant, you can always optimize later.
Since persistent threads are tricky, you will want to have a working CUDA reference implementation anyway so that you can check whether any problem is due to persistent threads or something else in your code.

Thanks for pointing me in the right direction, tera. I will finish writing code to use one kernel invocation per timestep, and the timestep loop will run on the host. Once I’m satisfied with this code, I will try to use persistent threads.

How would I swap pointers to pitch linear texture? If the pointers are for pitch linear memory allocated on the GPU device and then bound to 2D texture memory, how do I swap the pointers within the timestep running on the host?

Is there a good persistent threads example that I can use for reference when implementing this type of threaded programming?

How would I swap the pointers to memory? Within the time loop on the host, I’ve done the following:

// swap the matrices

		// pnsub1 = pn

		// pn = pnplus1


		CUDA_CHECK( cudaMemcpy (pnsub1, pn, Nx * Ny * sizeof(float), cudaMemcpyDeviceToDevice) );

		CUDA_CHECK( cudaMemcpy (pn, pnplus1, Nx * Ny * sizeof(float), cudaMemcpyDeviceToDevice) );

However, once this code is run I receive the following error:

unspecified launch failure

What do I have to do to be able to swap the pointers? Since pn, pnsub1, and pnplus1 are pointers to pitch linear memory allocated on the GPU device, how would I swap the pointers using code running on the host?

Could anyone comment on the possibility of swapping the pointers to texture memory? Does the swap have to be done on the device (by starting another kernel to do the swap), or can the swap be done on the host by copying memory?

Essentially what I have are pointers to pitch linear memory. The pitch linear memory has been allocated on the device. Moreover, the pitch linear memory has been bound to a 2D texture.

Need rebind texture references to arrays. Also for textures you need to use something like memcpy2d. Cause it uses pitch or so.

Thanks, Lev; I will give this a try and then post back my findings here. Many thanks!