Trigonal grid/block

Hi,

sometimes there is no need to use the whole square grid or block, but only the upper (lower) trigonal part (like in the case of symmetric matrices). What is the fastest way to do it with CUDA?

Thanks in advance for any suggestions.

I remember this old thread on a similar topic - maybe you can take some inspiration from it.

Thank you tera, i will consider these options.