I don’t know of any examples that does a similar thing, but you should be able to port this to the GPU fairly easily using OpenACC. Using the “kernels” directive, the compile will be able to parallelize your array syntax, but you’ll gain greater control over the loop scheduling if you make these explicit loops. Matmul is a problem since we don’t support this on the device, but it’s straight forward to implement as loops instead, or you can call the device version of cuBLAS DGEMM/SGEMM.
Though, I’d suggest you familiarize yourself with OpenACC first before beginning. You can find lots of resources on openacc.org, including an online course. See: Resources | OpenACC