Is there asynchronous store (to global) instruction in PTX?

Hi,
While I’m implementing fp32 GEMM kernel using C++, with storing operations (to global memory) written in embedded PTX. It turns out that the storing operations are the bottleneck. Commenting out the stores leads to 13x speedup.
I have tried both “st.global.cg.f32” and “st.global.wt.f32”, results are similar.
Because each of the stores are to unique address, and subsequent operations are not dependent on the output memory. I’m wondering if there is asynchronous (non-blocking) store instruction to global memory that can speedup my code?

Running on 1080-ti, CUDA 10.2, Windows 7.

Thank you for your help!

Remember that ptxas is an optimizing compiler, not an assembler. If you remove the stores, all computation needed to produce the data for those stores becomes dead code and is eliminated by the compiler. Since this is GEMM, that means pretty much the entire kernel (possibly minus some fixed setup code due to the ABI) will disappear. You are getting 13x speed-up because you are now measuring what is essentially a null kernel.

Machine code (SASS) is what ptxas produces and what is actually running on the GPU. You can look at it by applying cuobjdump --sass to your executable. I would suggest using the CUDA profiler to identify performance bottlenecks in your code. I would be highly surprised if it flags the stores as a bottlneck.

Thanks! You are probably right. I tuned loading & shared memory coalescing, speed is 9x as before.