Hi I have an application that has M columns and each column with F rows. This M*F matrix can be 20+ GB and therefore beyond the capacity of any single GPU we see.
In the kernel, each thread irregularly and sparsely accesses many columns of among the M. Any suggestion on how to speed up this type of application?
put the matrix in host memory and use unified memory so that the kernel can access. This is to be slow.
stage the matrix in batches in device memory. Since the access is irregular, each thread cannot complete until you stage the all matrix in.