Dependency of a matrix in a device

I’ve NxN Matrix.
I should write a CUDA kernel that perform operations on all diagonals of the matrix, in such a way that the calculation on the diagonal n + 1 occurs after that of the n diagonal.


compute on m[1][1]
compute on m[2][1],m[1][2]
compute on m[3][1],m[2][2],m[1][3]
compute on m[4][1],m[3][2],m[2][3],m[1][4]

…and so on.

I confused about how to write the code. Who can help me?

If I understand correctly your question, you should write a global function having the diagonal order n as input, making the computations on

m[n][1], m[n-1][2], m[n-2][3], ..., m[1][n]

and then calling such a kernel function iteratively from the CPU like

for (int n=0; n<N; n++) kernel<<<...,...>>>(n);

Yes, but for each diagonal the number of elements to be treated change and then, consequently for each call kernel number of elements to be treated change.

How do I bind a thread to each element?


-Call Num. 1: Thread 1 -> m[1][1]

-Call Num. 2: Thread 1 -> m[2][1] Thread 2 -> m[1][2]

-Call Num. 3: Thread 1 -> m[3][1] Thread 2 -> m[2][2] Thread 3 -> m[1][3]

…and so on.

You are right. I do not know why, but I had in mind that you had to make the same calculations on a large number of matrices in parallel, but, from your answer, it seems that you are operating on only one.

Anyway, could you provide some more details about your problem? Are the calculations “elementwise” or do they involve combinations of the diagonal elements? Of course, lanching a kernel with a single thread for the [1][1] diagonal would be highly inefficient.

I believe he’s talking about this:

In the post at, asterix87 is asking:

I have a NxN matrix. I have to update each element using a formula that needs the previous elements.

For example, to compute a value of a[i][j] I need a date value of a[i-1][j] and a[j-1].

In this case, a[i-i][j] and a[i][j-1] are on the same column and row of a[i][j], respectively, and not on the same diagonal. Is it the same problem?

Yes, the case is to. I ve to translate it into code.

Is the same problem. I ve to translate it into code.

If the problem is the one described in

then I would say (correct me if I’m wrong) that diagonal elements have nothing to do with what you have to code.

I would say that you need a 2d grid of threads, where each thread is associated to a different value of [i][j] and will change a[i][j] based on the values of a[i-1][j] and a[i][j-1].

Depending on the GPU architecture you are working with, you may wish to consider using texture memory for storing the matrix a.

Is the problem related to finite differences?

In this forum my problem is better explained:

in my case, I can use more kernel calls

The post on StackOverflow seems to have already a pertinent answer. Is it unsatisfactory to you or is it satisfactory and you want to be suggested with implementation details?

Do you have then a wavefront which propagates across the domain and a[i][j] is updated according to a[i-1][j] and a[i][j-1]? Do I understand correctly?

I would appreciate details of implementation

do you think it could work an approach of this kind?

int main{


for(i=1; i<=sqrt(dimMatrix); i++)



__global__ void kFunct(int n, float *mat){

  float *elem=malloc(n*sizeof(float));


 for(i=0; i<n; i++){

 //compute elem[] elements in parallel


P.S.: Can I use malloc function in kernel function?