I’ve tried implementing a matrix multiply that’s as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?
I tried very hard to optimize it, and experimented with many techniques. I’m afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.