[Solved]Same Cublas Functions work slower on the GTX1080 from GTX 960M

MDAhmetKemal · June 4, 2018, 9:30pm

I apologize first that i cannot write English properly.

I wrote a Neural Network program that uses cudnn and cublas.
This program does rigth calculate on the GTX 1080(I have Got two Card) and on the GTX 960M.

SOLUTUION:
My first 1080 Card is on the PClEx16 and second 1080 card is on the PCLEx1.
I disabled second card on the Device Manager.And Problem was solved.
Now I can Get true results

Every time gives me the rigth results both devices .
The program is running slower on the GTX 1080 and runnning faster on the GTX 960M.

when I run the program with NSigth I see slow functions working with cublas .
I tried to explain this is in the picture on the link below.

Same blas functions run slower on the 1080 .
The functions take the same values in the both devices and they call at the same number.
Only some cublas functions run slowly and my function.

I marked it in the picture.

How can i  solve this problem,can this cause problems?

[url]https://serving.photos.photobox.com/613421404da7eaca5771874d9216c821b10a7ce3526f1f7cccfd921eba0d5aef27ced42a.jpg[/url]

Robert_Crovella · June 4, 2018, 9:58pm

Which CUDA version?

MDAhmetKemal · June 5, 2018, 6:16am

I get the same result when I compile the program as RELEASE and DEBUG on two different computers.

First Computer Configuration with GTX 960M,
CUDA 9.1,Cublas 9.1,cudnn_7,
Intel 6700HQ,Visual Studio 2013,Driver 388.59

Second Computer Configuration with GTX 1080 -(I have Two Card )
CUDA 9.0,cublas 9.0,cudnn_7
AMD Ryzen 1700,Visual Studio 2013,Driver 391.24

Third Computer Configuration with GTX 1070
CUDA 9.1,Cublas 9.1,cudnn_7
Intel 3770K ( I didnt build on this computer )

there are no time changes in the two compiled programs on two different computers

when I was running third different computer,I get the good result on the GTX 1070 more than Gtx 960M

All compute Time :gtx960M : 11 second , GTX 1070 6.6 second and GTX 1080 27 second give me .
could the reason be to CPU ?

MDAhmetKemal · June 5, 2018, 12:34pm

I found out which kernel the problem originated from

I measure kernel times on the both machines with NSigth.

The problem is due to the following for cycle
The code in the for loop works much slower with 1080
Code outside of the for loop works much faster with 1080

for (int idx = 0; idx < C_size; idx++){
	int plusPtr1 = idx*A_size*B_size;
	int plusPtr2 = idx*A_size*B_size;
	int plusPtr3 = idx*A_size;
	int plusPtr4 = idx*B_size;

	cublasSgemm_v2(
		cublasHandle,
		CUBLAS_OP_N, CUBLAS_OP_T,
		A_size,
		B_size, 1,
		&one,
		matrix4+plusPtr3, A_size,
		matrix10+plusPtr4, B_size,
		&zero,
		matrix8, A_size);
	// gemmk1_kernel  GTX960m: 2.1ms   GTX 1080 :46.6 ms 
	

	cublasSgemm_v2(
		cublasHandle,
		CUBLAS_OP_N, CUBLAS_OP_N,
		A_size, 1, B_size,
		&one,
		matrix3+plusPtr2, A_size,
		matrix10+plusPtr4, B_size,
		&one,
		matrix9, A_size);
	// gemv2N_kernel GTX960m: 3.4 ms   GTX 1080 :47.7 ms 

	cudaElmWise(matrix8+plusPtr2,
		matrix1,
		matrix7+plusPtr1,
		A_size*B_size, 0.0f,
		&stream);
	// myKernelElementWise  GTX960m: 5.6 ms   GTX 1080 :47.6ms 
	
	cublasSaxpy_v2(
		cublasHandle,
		A_size*B_size,
		&learningRate,
		matrix11, 1,
		matrix6, 1);
	// axpy_kernel_val GTX960m: 1.6ms   GTX 1080 :41.3ms 
}
///OTHER kernel RUN EXAMPLE:

// these codes are being called out of the above (for loop)
//cudnn:activation_bw_4dkernel GTX960m: 361.5ms   GTX 1080 :217.8ms
//gemv2N_kernel                 GTX960m: 20.9ms   GTX 1080 :5.6ms
//sgemm_128x128x8_TN_vec       GTX960m: 365.1 ms  GTX 1080 :102.7 ms

Could the problem be the result of usage ?:

int plusPtr2 = idx*A_size*B_size;
int plusPtr4 = idx*B_size;
cublasSgemm_v2(
		cublasHandle,
		CUBLAS_OP_N, CUBLAS_OP_N,
		A_size, 1, B_size,
		&one,
		matrix3+plusPtr2, A_size,
		matrix10+plusPtr4, B_size,
		&one,
		matrix9, A_size);

Topic		Replies	Views
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1002	August 23, 2018
cublas problem: some blas 1 functions extremely slow! CUDA Programming and Performance	2	1649	November 24, 2009
Odd timing results Intel MKL vs. My GPU implementation CUDA Programming and Performance	5	3609	July 24, 2008
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1515	June 27, 2018
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5092	February 10, 2011
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1487	February 12, 2010
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3475	December 26, 2008
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28145	February 1, 2011
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1665	September 29, 2009
CUBLAS iteration processing time increases with iteration CUDA Programming and Performance	5	3620	August 17, 2007

[Solved]Same Cublas Functions work slower on the GTX1080 from GTX 960M

Related topics