The OP’s physics isn’t the kindergarten physics of so-called “physics engines” but real physics.
C[m, n, o, p] = C[i, j ,k ,l]*A[i, m]*A[n, j]*A[o, k]*A[p, l]
Ahh, the beauty of tensor mathematics and the Einstein summation convention.
I’m not sure of any way that you can parallelize this on the GPU, however. GPU performance doesn’t get decent until you have more than 10,000 threads, and there is no way to break this tiny computation down to that level. Do you by chance have thousands of these C tensors to calculate? You could just handle one per thread with some tricky manipulation to get all the memory reads coalesced.
If you have thousands of C tensors to calculate, you might consider taking the simplest approach first and just having each thread calculate a single tensor. It doesn’t expose all of the parallelism present, but with CUDA often times the “Keep it Simple” is the best way to go.
If this problem occurs at equilibrium state,I suggest the formula be changed as below:
C1[m,n,o,p]=f(C0[i,j,k,l])
and a SCF module be added to converge it.
It will be easy to be parallelized.
It works only for equilibrium state. That is to say , finally your C matrix are stable.
But it seems that the problem you described is not balanceable. For such path-depended and many-body sensitive system, parallelization is very difficult.