map blocks to only half matrix previous experiences?

Hello,

i have an algorithm that only needs to process the upper triangular part of a matrix,

what im doing now is mapping the whole matrix size N x N to a 2D grid and then with if-else i filter which blocks get processed. im using shared memory btw.

i was wondering would i get a good speedup if i somehow manage to map the blocks only for the upper triangular part?

I did something similar for a covariance matrix computation.

I launch with N*(N+1)/2 blocks instead of N*N and then in the kernel compute the row,col based on block index (which is a tad messy). I don’t recall how much it improved performance over the “if(col >= row) compute_something; else return” since I was refactoring a bunch of things at the same time, but it seemed like a more logical implementation.

AFAIK the block scheduler schedules blocks in batches and wouldn’t backfill new blocks if individual ones returned before others, but I could be completely wrong.

i also think that there might be a little speed up,

so you say you made it possible to access de column and row only with the block index, do you have any clue with the logic?
im also figuring out a good method, seems a little messy.