Extrapolate results to modern card

Anton_Burtsev · August 5, 2011, 3:58pm

Hello,

I’m evaluationg CUDA using my old GeForce 9600GT. I’ve created a simple test and got some result. Can someone help me to get what that result can be on new modern devices like TESLA M2090 or better.

For evaluation I generate an array of 32-bit integers and calculate count of those which contain bits specified by mask. I found out that 9600GT can handle 4 different masks over 256 MB of numbers for 180ms. What speed can be achived on modern devices?

Here is my test code:

#include <stdio.h>

const int THREADS = 512;				// Number of threads for <<<1, XXX>>> clause

const int DATASIZE = 1024*1024*256;		// Data size in bytes to process

__global__ void processPage(unsigned int * page, int * Results, int N, int mask)

{

	int4 * vpage = (int4*)page;

	int c = 0;

	int max = N/4;

	for ( int i = threadIdx.x; i < max; i+=THREADS )

	{

		int4 v = vpage[i];

		if ( (v.x & mask) > 0 ) c++;

		if ( (v.y & mask) > 0 ) c++;

		if ( (v.z & mask) > 0 ) c++;

		if ( (v.w & mask) > 0 ) c++;

	}

	Results[threadIdx.x] = c;

}

int N;

int Results[THREADS];

int * cResults;

unsigned int * data;

unsigned int * cData;

void initTest()

{

	N = DATASIZE / sizeof(int);

	cudaMalloc(&cResults, THREADS*sizeof(int));

	cudaMalloc(&cData, N*sizeof(int));

	data = (unsigned int*)malloc(N*sizeof(int));

	for ( int i = 0; i < N; i++ )

		data[i]=i;

	cudaMemcpy(cData, data, N*sizeof(int), cudaMemcpyHostToDevice);

	free(data);

}

int doTest(int mask)

{

	processPage<<<1, THREADS>>>(cData, cResults, N, mask);

	cudaMemcpy(Results, cResults, THREADS*sizeof(int), cudaMemcpyDeviceToHost);

	int c = 0;

	for ( int i = 0; i < THREADS; i++ )

		c += Results[i];

	return c;

}

void finishTest()

{

	cudaFree(&cData);

	cudaFree(&cResults);

}

int main()

{

	initTest();

	int clc = clock();

	int r

		= doTest(87234)

		+ doTest(45786)

		+ doTest(923569726)

		+ doTest(51465123);

	clc = clock()-clc;

	finishTest();

	printf("%d - %dms\n", r, (clc*1000/CLOCKS_PER_SEC));

	return 0;

}

luis-tec · August 5, 2011, 4:20pm

On a gentoo-linux 64 bits, Tesla 2050, compiled without options, four times:

267755520 - 70ms
267755520 - 80ms
267755520 - 70ms
267755520 - 70ms

Anton_Burtsev · August 5, 2011, 4:42pm

Great,
Thank you!

seibert · August 5, 2011, 6:57pm

I would be wary of using a benchmark with only 1 block. The newer cards can (generally) run more blocks simultaneously than the old cards, so it can be hard to extrapolate based on the performance of a single block.

Topic		Replies	Views
Need someone who can run a piece of code for me CUDA Programming and Performance	2	4996	May 26, 2009
Slow memory access on Tesla C1060 CUDA Programming and Performance	0	2610	January 29, 2010
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10568	July 7, 2009
How to get more Gflops ? :) CUDA Programming and Performance	21	27838	September 12, 2008
Tesla C870 slower than GForce 9600 GT ? CUDA Programming and Performance	6	1586	May 23, 2010
Basic Cuda Confusion - help CUDA Programming and Performance	9	2019	February 11, 2013
Need help getting more results CUDA Programming and Performance	0	747	May 26, 2010
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	370	December 31, 2024
Disappointed performance using C2050 CUDA Programming and Performance	20	8075	September 2, 2010
Skybuck's RAM Test version 0.07 available. CUDA Programming and Performance	19	1838	July 26, 2011

Extrapolate results to modern card

Related topics