 # Parrallelization of Numerical Integration Loop? How is loop parallelization accomplished in CUDA

I have a C/C++ code that computes antenna radiation patterns using a computational electromagnetic technique called Physical Optics. I would like to port it to the Tesla architecture. My question is: can a typical C language “for loop” that performs a numerical integration of complex valued double precision ( or single precision) real numbers be parallelized in CUDA to run on a the tesla architecture? If so I am very interested in using CUDA and the Tesla architecture. Thanks!

I am not sure. But if the numerical integration can be performed as a set of “additions”…, it could scale superbly on GPU.

i.e. If Integral A to B (f(x)) == Integral A to C (f(x)) + Integral C to B(f(x)) then you could compute AtoC and CtoB in separate threads… and later launch another kernel that would add them up.

I am not sure if this property holds good for all types of functions. You may have to ponder over it.

When the intervals taken are really minute then you get high precision and good GPU scaling as well.

Best Regards,
Sarnath

Could you paste some pseudocode perhaps?

I presume you are doing something like computing a big surface integral in the frequency domain, where you first transform some arbitrary surface geometry onto a regular, rectilinear domain, and then perform quadrature of the transformed function on that domain?

That sort of cell/element processing followed by summation can parallelize rather well in CUDA - there are a number of papers floating around describing finite volume and discontinuous Galerkin finite element implementations in CUDA which are broadly similar (although mostly restricted to real rather than imaginary domains). Without knowing all that much about PO, I don’t see why it shouldn’t be feasible - you would probably want to use a kernel with something like a 1 thread = 1 cell for the preliminary element processing (so unrolling i,j,k dimensional loops into an i,j,k thread space where one thread does all of the calculations on a single cell), and then a second kernel to perform parallel summation of the individual cell contributions to yield the final surface integral.