anti-diagonal matrix parallelism

How do i cross anti-diagonal of matrix and compute the anti-diagonal elements in the same time?

ie how can i loop the elements of anti-diagonal in CUDA?

How do i cross anti-diagonal of matrix and compute the anti-diagonal elements in the same time?

ie how can i loop the elements of anti-diagonal in CUDA?