Achive Performance in K620 GPU

Hi,

I am having 2 different card K600 (Keppler) and K620 (Maxwell)

But I am not being able to achive the performance benefits from K620 GPU :(

previously my project was built with Cuda 4.0 - in this forum with responce to my previous post related one accuracy settings I changed my project settings from CudA 4 to cuda 6.5. That improved the performance for both K600 and K620 - but still K600 is performing much better than K620

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda_runtime.h>
#include <stdio.h>
#include <Windows.h>
#include <io.h>
#include <iostream>
#include <fstream>
#include <stdio.h>

using namespace std;


using namespace std;

__global__ void AddPixel(int* pMat1);

void Writeintdata(char* szFileName,int val) {
	std::ofstream oFile, oFilexy;
	remove(szFileName);
	char szbuff[100];
	sprintf(szbuff, "%d ", val);
	oFile.open(szFileName);
	oFile.write(szbuff, strlen(szbuff));
	oFile.write("\n", 1);
	oFile.close();
}

//int blockSize;   // The launch configurator returned block size 
//int minGridSize; // The minimum grid size needed to achieve the 
//// maximum occupancy for a full device launch 
//int gridSize;    // The actual grid size needed, based on input size 

void MatLaunch(int* pMat1, int DIMX)
{
	//int threadsPerBlock = 256; 
	//int numBlocks = 4096;
	int* pDevMat1 = NULL;
	int size = DIMX * sizeof(int);


	int iStatus = cudaMalloc( &pDevMat1, size); 
	if(iStatus == cudaSuccess)
	{
		iStatus = cudaMemcpy(pDevMat1, pMat1, size,cudaMemcpyHostToDevice);
	}
	if(iStatus == cudaSuccess)
	{

		AddPixel<<<4096,256>>>(pDevMat1);

		cudaDeviceSynchronize();
		iStatus = cudaGetLastError();
	}
	if(iStatus ==cudaSuccess)
	{
		iStatus = cudaMemcpy(pMat1, pDevMat1, size,cudaMemcpyDeviceToHost);
	}
	if(iStatus == cudaSuccess)
	{
		iStatus = cudaFree(pDevMat1);
	}
}

__global__ void AddPixel(int* pMat1)
{
	int X = blockIdx.x * blockDim.x + threadIdx.x;
	pMat1[X] = pMat1[X] + 25;
}



int main()
{

	std::ofstream myfile;
	myfile.open ("D:\\CUDA\\CUDA6.5_Old_Timings.csv", std::ios::out | std::ios::app );
	__int64 ctr1 = 0, ctr2 = 0;
	double wpcnt =0;
	long size = 1024*1024;
	int* pMatLaunch = new int;
	for(int i = 0;i<size;i++)
	{
		pMatLaunch[i] = 50;
	}

	//cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
	//	AddPixel, 0, size); 
	//// Round up according to array size 
	//gridSize = (size + blockSize - 1) / blockSize; 

	myfile <<endl;
	myfile << "1kx1k Start";
	myfile <<endl;
	for(int i = 0;i<30;i++)  // I am taking 30 iterations and taking avrg time (excluding first value)
	{
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr1);
		MatLaunch(pMatLaunch,size);
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
		wpcnt = ctr2-ctr1;
		myfile << "Time taken  for 1kx1k image is "<<","<< wpcnt<<endl;
	}
	myfile << "1kx1k Complete";
	myfile << "-------------";
	myfile <<endl;

	ctr1 = 0, ctr2 = 0;
	size = 2048*2048;
	pMatLaunch = new int;
	for(int i = 0;i<size;i++)
	{
		pMatLaunch[i] = 50;
	}
	myfile <<endl;
	myfile << "2kx2k Start";
	myfile <<endl;

	cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
		AddPixel, 0, size); 
	// Round up according to array size 
	gridSize = (size + blockSize - 1) / blockSize; 


	for(int i = 0;i<30;i++)
	{
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr1);
		MatLaunch(pMatLaunch,size);
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
		wpcnt = ctr2-ctr1; 
		myfile << "Time taken  for 2kx2k image is "<<","<< wpcnt<<endl;
	}
	myfile << "2kx2k Complete";
	myfile << "-------------";
	myfile <<endl;
	myfile <<endl;

	ctr1 = 0, ctr2 = 0;
	size = 4096*4096;
	pMatLaunch = new int;
	for(int i = 0;i<size;i++)
	{
		pMatLaunch[i] = 50;
	}


	myfile << "4kx4k Start";
	myfile <<endl;

	cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
		AddPixel, 0, size); 
	// Round up according to array size 
	gridSize = (size + blockSize - 1) / blockSize; 

	for(int i = 0;i<30;i++)
	{
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr1);
		MatLaunch(pMatLaunch,size);
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
		wpcnt = ctr2-ctr1; 
		myfile << "Time taken  for 4kx4k image is "<<","<< wpcnt<<endl;
	}
	myfile << "4kx4k Complete";
	myfile << "-------------";
	myfile <<endl;
	myfile <<endl;

	ctr1 = 0, ctr2 = 0;
	size = 6000*6000;
	pMatLaunch = new int;
	for(int i = 0;i<size;i++)
	{
		pMatLaunch[i] = 50;
	}

	myfile << "6kx6k Start";
	myfile <<endl;
	cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
		AddPixel, 0, size); 
	// Round up according to array size 
	gridSize = (size + blockSize - 1) / blockSize; 

	for(int i = 0;i<30;i++)
	{
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr1);
		MatLaunch(pMatLaunch,size);
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
		wpcnt = ctr2-ctr1; 
		myfile << "Time taken  for 6kx6k image is "<<","<< wpcnt<<endl;
	}
	myfile << "6kx6k Complete";
	myfile << "-------------";
	myfile <<endl;	

	ctr1 = 0, ctr2 = 0;
	size = 10000*10000;
	pMatLaunch = new int;
	for(int i = 0;i<size;i++)
	{
		pMatLaunch[i] = 50;
	}

	myfile << "10kx10k Start";
	myfile <<endl;
	cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
		AddPixel, 0, size); 
	// Round up according to array size 
	gridSize = (size + blockSize - 1) / blockSize; 
	for(int i = 0;i<30;i++)
	{
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr1);
		MatLaunch(pMatLaunch,size);
		QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
		wpcnt = ctr2-ctr1; 
		myfile << "Time taken  for 10kx10k image is "<<","<< wpcnt<<endl;
	}
	myfile << "10kx10k Complete";
	myfile << "-------------";
	myfile <<endl;

	myfile.close();
	return 0;
}

I tried to change the block size from 256 to 512 in case of matrix size 2K X 2K and above. but still no improvemnt in perfomance.

— With all the possible combinations I am trying always K600 is showing better performance.
Project settings I am using same as what is being used by 6.5 sample program -> C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\0_Simple\template_runtime

— If some one can check this and let me know if the problem is in coding or project settings and what changes are needed so that my K620 card can outperform the K600 card - it will be very helpful.

Incase there is some changes needed in project settings it will be very helpful if you upload the project setting file some where - so that I can download and try

Thanks

Performance of your kernel is bound by memory bandwidth. According to a handy comparison table on Wikipedia (http://en.wikipedia.org/wiki/Nvidia_Quadro), the Quadro K600 and Quadro K620 both have a 128-bit DDR3 memory subsystem that provides 29 GB/sec. So the bandwidth of the memory on the two card is identical, only the size of the memory differs (1 GB for K600, 2 GB for K620).

So with proper tuning your code should basically run at identical speed on both devices. How much of a difference are you seeing? Are you compiling for the appropriate architectures using the -arch switch of nvcc?

Thanks Njuffa for replying.
the difference I am observing is like this (all times in mili seconds - taking avrg of 30 iterations)

Matrix          |1024 X 1024|	2048x2048    |	4096x4096   |	6000 X 6000 |	10000 X 10000
						
K600		|13.42134483|	 43.94172414 | 	162.8675172 |	344.8991379 |	948.8958966
K620		|12.76      |    43.62696552 |	165.210931  |	350.8838276 |	966.8457586

there is a big difference in case of 10K matrix.

Regarding the project settings - as I said I took templateruntime sample prroject as my base project.So I am really not sure what other changes are needed to run the code properly in maxwell
this is the build log- while building with K600 PC (so that machine 32 is coming - will update the K620 build after I cahneg the card)

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe" -gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-version 2010 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I./ -I../../common/inc -I./ -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\/include" -I../../common/inc -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include"     --keep-dir Release -maxrregcount=0  --machine 32 --compile -cudart static -Xcompiler "/wd 4819"     -DWIN32 -DWIN32 -D_MBCS -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Zi  /MT  " -o Release\MatrixAdd.cu.obj "E:\Cuda_SampleProject\MatrixAdd\MatrixAdd.cu"

It will be very helpful if you can upload a vcproject file that will be having proper build settings for Maxwell.

Thanks

The difference between K600 and K620 for the 10K x 10K case appears to be 1.9%. I usually refer to differences of less than 2% as “noise”, too small to bother about.

Sorry, I do not use Visual Studio project files, I find them to be a complicated, non-intuitive, and difficult to debug way of building projects. You would want to make sure that your Maxwell build target uses -arch=sm_50 or equivalent nvcc flags.

There could be any number of reasons for the small performance difference between K600 and K620 despite the fact that they should theoretically have the same performance based on memory bandwidth. You may want to play with different thread-block configurations, e.g. 128, 256, 384 threads per block. From experimenting with the STREAM benchmark on GPUs I know that there are often a couple percent difference between these configurations, probably due to different hardware scheduling artifacts.

You would also want to try processing multiple pixels at a time, by using the short-vector types ‘int2’ or ‘int4’. The use of 64-bit or 128-bit loads and stores allows the GPU to exploit a somewhat higher percentage of the theoretical memory bandwidth. This should improve the performance on both K600 and K620.

I tried with different thread block sizes also - but every time K600 result was better.
and for K620 I am using -arch=sm_50 only.

As you mentioned about int2/4 - I tried that and getting soem improvement in case of big size datas - but still my initial problem remains - K600 performance is better than K620

Are these two GPUs in the same system, or in different systems? If they are in different systems, do the results change when you swap the GPUs between the systems? If they are in the same system, do you get different results when you physically swap these cards with regard to the PCIe slots they are located in?

Based on your posted code, it seems the execution time measurements comprise both copy transfer time from/to the GPU and kernel execution time. Have you tried separating these to see how they contribute to the small overall differences? Is it the kernel execution time or the copy time? The CUDA profiler can give you precise data regarding for the actual kernel execution time, as well as the duration of the cudaMemcpy() calls.

In addition, the measured time appears to cover cudaMalloc() and cudaFree() calls, which is driver activity on the host rather than GPU activity. I would suggest moving these call out of the code region for which you are measuring execution times, and to allocate once prior to the timed portion of the app and free once after the timed portion of the app.

Both cards are used in same system - I am just changing the card.

and regarding the timing calculation - I just put for overall operation instead of the kernel -I can check if the copy operation or kernel operation is taking more time -will let you know which is taking more time.

As I stated before, I consider a 1.9% difference noise level and personally would not invest much time investigating the reason for such a difference. My recent suggestions are just targeted at eliminating sources of overhead that may skew your performance measurements. For example, time for copies between host and device can be impacted by different host memory configurations or PCIe slot configurations.

Using one physical system and changing nothing except the GPU is the kind of controlled experiment that eliminates many possible “skew factors”. In addition, it would be best to focus on pure kernel execution times as reported by the profiler, eliminating the overhead for host-side driver operations like malloc/free and host/device copies for the time being.

Even in that scenario, I would expect to see small differences, despite the code being memory bound on both GPUs and the GPU memory of the two cards offering nominally identical throughput. The two GPUs use different architectures. Scheduling differences may cause actual memory operations to be ordered differently as presented at the external DRAM interface, causing small differences in efficiency of access. The memory controller of the two GPUs could be using somewhat different DRAM timings, for example because the DRAM chips are from different manufacturers or otherwise have slightly different specifications.

I do not know whether these GPUs use auto-boost to dynamically adjust the GPU clocks. Different core clocks can have some influence on the efficiency of the memory interface, presumably due to internal queue management. There are probably many more second-order factors that I do not even know about.