Optimization tips for ARMv8 code on K1 Denver?

I got a Nexus 9 for doing ARMv8 work and have begun optimizing my code for 64-bit and ARMv8 assembly language. So far, I’m not seeing performance that’s terribly impressive. My imaging codecs are running about 10-15% faster than on the 2.3Ghz Qualcomm MSM8974 in my Nexus 5 (single threaded, optimized C code). Is there a guide somewhere to help with optimizing on the K1? Instruction scheduling, cache prefetch tips, etc?

Thanks,
Larry B.