How to parallelize with CUDA this function
__global__ void mul(int *res,int *NUM1, int *NUM2,int w,int e)
NUM1 and NUM2 1d array.
First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.
So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)