[Solved]Same Cublas Functions work slower on the GTX1080 from GTX 960M

I apologize first that i cannot write English properly.

I wrote a Neural Network program that uses cudnn and cublas.
This program does rigth calculate on the GTX 1080(I have Got two Card) and on the GTX 960M.

My first 1080 Card is on the PClEx16 and second 1080 card is on the PCLEx1.
I disabled second card on the Device Manager.And Problem was solved.
Now I can Get true results

Every time gives me the rigth results both devices .
The program is running slower on the GTX 1080 and runnning faster on the GTX 960M.

when I run the program with NSigth I see slow functions working with cublas .
I tried to explain this is in the picture on the link below.

Same blas functions run slower on the 1080 .
The functions take the same values in the both devices and they call at the same number.
Only some cublas functions run slowly and my function.

I marked it in the picture.

How can i  solve this problem,can this cause problems?


Which CUDA version?

I get the same result when I compile the program as RELEASE and DEBUG on two different computers.

First Computer Configuration with GTX 960M,
CUDA 9.1,Cublas 9.1,cudnn_7,
Intel 6700HQ,Visual Studio 2013,Driver 388.59

Second Computer Configuration with GTX 1080 -(I have Two Card )
CUDA 9.0,cublas 9.0,cudnn_7
AMD Ryzen 1700,Visual Studio 2013,Driver 391.24

Third Computer Configuration with GTX 1070
CUDA 9.1,Cublas 9.1,cudnn_7
Intel 3770K ( I didnt build on this computer )

there are no time changes in the two compiled programs on two different computers

when I was running third different computer,I get the good result on the GTX 1070 more than Gtx 960M

All compute Time :gtx960M : 11 second , GTX 1070 6.6 second and GTX 1080 27 second give me .
could the reason be to CPU ?

I found out which kernel the problem originated from

I measure kernel times on the both machines with NSigth.

The problem is due to the following for cycle
The code in the for loop works much slower with 1080
Code outside of the for loop works much faster with 1080

for (int idx = 0; idx < C_size; idx++){
	int plusPtr1 = idx*A_size*B_size;
	int plusPtr2 = idx*A_size*B_size;
	int plusPtr3 = idx*A_size;
	int plusPtr4 = idx*B_size;

		B_size, 1,
		matrix4+plusPtr3, A_size,
		matrix10+plusPtr4, B_size,
		matrix8, A_size);
	// gemmk1_kernel  GTX960m: 2.1ms   GTX 1080 :46.6 ms 

		A_size, 1, B_size,
		matrix3+plusPtr2, A_size,
		matrix10+plusPtr4, B_size,
		matrix9, A_size);
	// gemv2N_kernel GTX960m: 3.4 ms   GTX 1080 :47.7 ms 

		A_size*B_size, 0.0f,
	// myKernelElementWise  GTX960m: 5.6 ms   GTX 1080 :47.6ms 
		matrix11, 1,
		matrix6, 1);
	// axpy_kernel_val GTX960m: 1.6ms   GTX 1080 :41.3ms 

// these codes are being called out of the above (for loop)
//cudnn:activation_bw_4dkernel GTX960m: 361.5ms   GTX 1080 :217.8ms
//gemv2N_kernel                 GTX960m: 20.9ms   GTX 1080 :5.6ms
//sgemm_128x128x8_TN_vec       GTX960m: 365.1 ms  GTX 1080 :102.7 ms

Could the problem be the result of usage ?:

int plusPtr2 = idx*A_size*B_size;
int plusPtr4 = idx*B_size;
		A_size, 1, B_size,
		matrix3+plusPtr2, A_size,
		matrix10+plusPtr4, B_size,
		matrix9, A_size);