While I’m implementing fp32 GEMM kernel using C++, with storing operations (to global memory) written in embedded PTX. It turns out that the storing operations are the bottleneck. Commenting out the stores leads to 13x speedup.
I have tried both “st.global.cg.f32” and “st.global.wt.f32”, results are similar.
Because each of the stores are to unique address, and subsequent operations are not dependent on the output memory. I’m wondering if there is asynchronous (non-blocking) store instruction to global memory that can speedup my code?
Running on 1080-ti, CUDA 10.2, Windows 7.
Thank you for your help!