Thread theory? Splitting up a lower triangle matrix into threads

Hi everyone,

just a quick one really… just thinking about threads and how the best way to spilt up my problem is. I’m working on the lower triangle of a matrix. Ideally I want to utilise a 2-dimensional thread grid to split the matrix elements into threads, however, as only the lower triangle of the matrix is being used, this technique will result in idle threads. Obviously I don’t want to have idle cores during execution, but I’m not sure how the scheduler works and how linked the threads and cores are. Does anyone know if having idle threads means I will have idle cores? If so, does anyone have any other suggestions for splitting up a lower triangle of a matrix in to threads.

Cheers,
crip_crop

Hi everyone,

just a quick one really… just thinking about threads and how the best way to spilt up my problem is. I’m working on the lower triangle of a matrix. Ideally I want to utilise a 2-dimensional thread grid to split the matrix elements into threads, however, as only the lower triangle of the matrix is being used, this technique will result in idle threads. Obviously I don’t want to have idle cores during execution, but I’m not sure how the scheduler works and how linked the threads and cores are. Does anyone know if having idle threads means I will have idle cores? If so, does anyone have any other suggestions for splitting up a lower triangle of a matrix in to threads.

Cheers,
crip_crop

'

'

Hey Dittoaway,

you seem to have edited away your post… was that intentional?

Hey Dittoaway,

you seem to have edited away your post… was that intentional?

Yes. I realized I had written something incredibly stupid. :(

Yes. I realized I had written something incredibly stupid. :(

Probably the simplest way to deal with something like a triangular matrix is forget about trying to map a grid into the matrix and just launch a threads for every non-zero if you are doing element type operations. For dot product and the like, it would probably be more effect to have one warp or block per row or column, and just live with the fact that some warps or blocks will do less net work that others. That would certainly still be preferrable to having half a grid worth of threads doing no work.

Probably the simplest way to deal with something like a triangular matrix is forget about trying to map a grid into the matrix and just launch a threads for every non-zero if you are doing element type operations. For dot product and the like, it would probably be more effect to have one warp or block per row or column, and just live with the fact that some warps or blocks will do less net work that others. That would certainly still be preferrable to having half a grid worth of threads doing no work.

Depends on what you doing to the triangular matrix …

If the computation on each matrix element can be broken up, you could have thread i,j do part of the work and thread j,i do the rest. Of course the ‘diagonal’ threads would have to do both.

Depends on what you doing to the triangular matrix …

If the computation on each matrix element can be broken up, you could have thread i,j do part of the work and thread j,i do the rest. Of course the ‘diagonal’ threads would have to do both.

I’ve written about my approach on this. An upper triangular, but I am sure you can adjust it to your needs:

http://forums.nvidia.com/index.php?showtopic=179181&st=0&p=1111039&#entry1111039

It involved launching a 1D group of threads, but a quick algorithmic mapping to find its 2D coordinate.

I’ve written about my approach on this. An upper triangular, but I am sure you can adjust it to your needs:

http://forums.nvidia.com/index.php?showtopic=179181&st=0&p=1111039&#entry1111039

It involved launching a 1D group of threads, but a quick algorithmic mapping to find its 2D coordinate.

Thanks for the responses, they’re really useful.

The code that I’m working on calculates the forces for individual atom pairs. As they’re all independent calculations for each atom pair I’m going to try putting the atom pair combinations into a 1D matrix and then decomposing the matrix into threads, so each thread works on an atom pair. This way I won’t suffer from load imbalances. I’m not too sure what the best strategy for computing the atom pair matrix is though. I have 2 options: compute that atom pair matrix once on the host and then transfer it across to the device where it will be decomposed into threads; or compute the atom pair matrix on the device before decomposing it into threads… this however would mean every thread computes the matrix.

I’m not too sure which would be the better strategy, performance-wise. Any suggestion or comments would be much appreciated.

Crip_crop

Hi Antagonistic,

I’ve been playing around with your algorithm and trying to adjust it for the lower triangle (including diagonal) but I’m really struggling. I guess my maths isn’t up to scratch. Could you give me any tips on how I would do this?

Cheers,

Crip_crop