write aplication for Pascal

I am write a aplication for gpu.
At fermi, kepler, maxwell no problem, all work fine.
But at Pascal (GeForce 1070) aplication work to slow.

I use cuda.lib 8 for pascal series.
This my settings in ptx:

.version 5.0
.target sm_60
.address_size 64

What is the reason that in old card with cuda 7.5 everything works quickly, and in new card with cuda 8 the same is running slowly? After all, the PTX 5.0 have not new commands of memory or something else

Note that GTX 1070 has compute capability 6.1, not 6.0. Your observations are consistent with a build configuration that does not generate machine code (SASS) for sm_61, but requires JIT-compilation from PTX at runtime, which can create significant overhead. Double check your build settings.