Hello!

How to parallelize with CUDA this function

```
__global__ void mul(int *res,int *NUM1, int *NUM2,int w,int e)
{
int i,j,carry,temp;
for(i=0;i<e;i++)
{
carry=0;
for(j=0;j<w;j++)
{
temp=NUM2[i]*NUM1[j]+res[i+j]+carry;
carry=temp/10;
res[i+j]=temp-(carry*10);
}
res[i+j]=carry;
}
}
```

NUM1 and NUM2 1d array.

Hello,

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)