Cycles

for(int am=1; am<4;am++){

       for(int an=1; an<4;an++){

          for(int ao=1; ao<4;ao++){

             for(int ap=1; ap<4;ap++){

               int t=0;

               for(int ai=1; ai<4;ai++){

                   for(int aj=1; aj<4;aj++){

                      for(int ak=1; ak<4;ak++){

                         for(int al=1; al<4;al++){

                          t=t+C[ai,aj,ak,al];

                           }}}}

                          C[am,an,ao,ap]=t;

                           }}}}  

have 3x3x3x3=81

It is possible to paint the first part of a code as 81 threads

would not be desirable to write down manually all products

How to develop the enclosed cycles?

Where to look methods of disclosing of the enclosed cycles?

Parallel programming External Image

I’m not sure if this is a good way to program your applications :s

there must be a better way…

What are you trying to achieve here?

Physics of a firm body
a matrix of elasticity

C[m, n, o, p] = C[i, j ,k ,l]*A[i, m]*A[n, j]*A[o, k]*A[p, l]

i, j, k,l = 3

1.Where is A in the code?

2.As C[m,n,o,p] is updated instant, it will cause C[i,j,k,l] changing at each cycle, is it what you wanted? It seems very complex.

I’m not into the physics of firm or rigid bodies anymore but in the time I had to use physics we used those open source engines.

The OP’s physics isn’t the kindergarten physics of so-called “physics engines” but real physics.

C[m, n, o, p] = C[i, j ,k ,l]*A[i, m]*A[n, j]*A[o, k]*A[p, l]

Ahh, the beauty of tensor mathematics and the Einstein summation convention.

I’m not sure of any way that you can parallelize this on the GPU, however. GPU performance doesn’t get decent until you have more than 10,000 threads, and there is no way to break this tiny computation down to that level. Do you by chance have thousands of these C tensors to calculate? You could just handle one per thread with some tricky manipulation to get all the memory reads coalesced.

A=

[cos(a) sin(a) 0]

[-sin(a) cos(a) 0]

[ 0 0 1]

Looks so, it yet did not insert into a code

them in fact only 9 values

In the first part (4 cycles), I can paint manually, we have 81 threads

But if to try to paint all cycles completely, we have 6561

3^8=6561 Or I am not right?

open all on a paper and type a code :blink:

yes

If you have thousands of C tensors to calculate, you might consider taking the simplest approach first and just having each thread calculate a single tensor. It doesn’t expose all of the parallelism present, but with CUDA often times the “Keep it Simple” is the best way to go.

Thanks Mr.
I shall try

If this problem occurs at equilibrium state,I suggest the formula be changed as below:
C1[m,n,o,p]=f(C0[i,j,k,l])
and a SCF module be added to converge it.
It will be easy to be parallelized.

It is possible more in detail, has not understood an idea

Simply as one-dimensional function?

Abbreviation SCF that means?

In my problem 6561*4=26244
but only 16 kilobyte on the block it is possible to use
:(

SCF=self-consistent field

It works only for equilibrium state. That is to say , finally your C matrix are stable.

But it seems that the problem you described is not balanceable. For such path-depended and many-body sensitive system, parallelization is very difficult.

If the code is changed as below(am from ++ to --), will it fail for your system?

for(int am=3; am>0;am--){

for(int an=1; an<4;an++){

for(int ao=1; ao<4;ao++){

for(int ap=1; ap<4;ap++){

    int t=0;

    for(int ai=1; ai<4;ai++){

    for(int aj=1; aj<4;aj++){

    for(int ak=1; ak<4;ak++){

    for(int al=1; al<4;al++){

        t=t+C[ai,aj,ak,al];

    }}}}

    C[am,an,ao,ap]=t;

 }}}}