Why 8800 is faster?

Hello,

I compile the same code for macOS X (10.5.5) and windows XP (sp3).
The mac pro has a geforce 8800GT, two quad core and 8Gb of ram.
The first pc has a geforce GTX285, a quad core and 4Gb of ram.
The second ps has a geforce 8800GTX, a dual core and 4Gb of ram.

Why the time to execute the code is faster on the mac pro and the 2nd pc? The video card is really more powerfull on the first!!

Thx for your help.

Etienne

I have always this problem of performance that I can not understand. What are the
points that I should look?

Thank you

Someone to help me?

What sort of time differences are we talking of here? And have you checked that you are using the newest Nvidia drivers on them all?

I would guess you have some accidental double precision floating point which is slowing down the GTX285.

But in all seriousness, how do expect someone to help you? You have provided almost no useful information. What is your code? What CUDA and driver versions are you running on each platform? How do you compile the for each architecture? How do you measure the performance to conclude one is faster than the other?

Thanks for your answer.

Pc with Geforce 8800 utlra is 20% faster than GTX285 and the max is 7% faster than GTX285.

Since I install the newest driver (185.58) the GTX285 is still slower.

GF8800 have recommended drivers for cuda 2.1.

The Mac pro have the kext given with 2.1 toolkit.

Are you running any graphics related stuff? That would interfere with performance as well. But like avidday said, this is not enough information to help.
There are plenty examples on this forum of people timing the wrong parts of their code for instance. Maybe there is some cpu-related code also running? Does the application run long enough? etc.

I have seen this effect with the 8800 GTS and the GTX 280 on code with only little parallelism (less than 1000 threads). In that case this could be explained by the 8800 GTS being clocked at 1.6 GHz and the GTX 280 at 1.3 GHz. However, in your case, according to the NVIDIA page all cards should be clocked at 1.5 GHz, so this cannot be the reason.

Are you including memory copies to / from the card into your measurements? In case of a short running kernel but a lot of data to copy this might dominate your execution time and you are seing effects of CPU RAM and / or PCIe transfer speed.

Sorry, I give you more informations.

Yes I include in my measurement data copy. I will try measure without this.

The first Kernel is a reduction based on the Mark Harris algorithm (Reduction #6 : Completely Unrolled).

The second Kernel is just an acces to a texture in Linear filter mode to make a bilinear interpolation on a picture.

Here is the code of the 1st kernel :

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

/******************************************************

  • REDUCTION

******************************************************/

device void reduction(unsigned int blockSize, float sdata, int tid)

{

if (blockSize >= 512) { 

	if (tid < 256)

		sdata[tid] += sdata[tid + 256]; 

	__syncthreads();

}

if (blockSize >= 256) {

	if (tid < 128)

		sdata[tid] += sdata[tid + 128];

	__syncthreads();

}

if (blockSize >= 128) {

	if (tid <   64)

		sdata[tid] += sdata[tid +   64];

	__syncthreads();

}

if (tid < 32) {

	if (blockSize >=  64)

		sdata[tid] += sdata[tid + 32];

	if (blockSize >=  32) 

		sdata[tid] += sdata[tid + 16];

	if (blockSize >=  16) 

		sdata[tid] += sdata[tid +  8];

	if (blockSize >=    8) 

		sdata[tid] += sdata[tid +  4];

	if (blockSize >=    4) 

		sdata[tid] += sdata[tid +  2];

	if (blockSize >=    2) 

		sdata[tid] += sdata[tid +  1];

}

}

/****************************************************

/ SOMME - float en entrée

/***************************************************/

global void sum(float *d_in, float *d_out, unsigned int blocksize)

{

extern __shared__ float sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blocksize*2) + tid;

sdata[tid] = d_in[i] + d_in[i+blocksize];  



__syncthreads();

reduction(blocksize, sdata, tid);



if (tid == 0)

	d_out[blockIdx.x] = sdata[0];

}

/************************************************************


  • Calcule la somme

*********/

double somme(dim3 dimGrid, dim3 dimBlock, int smemSize, float* d_in, float* d_out, float* h_out, size_t size) {

int threads = dimBlock.x*dimBlock.y;

sum<<< dimGrid, dimBlock, smemSize >>>(d_in, d_out, threads);

cudaThreadSynchronize();

cudaMemcpy( h_out, d_out, size, cudaMemcpyDeviceToHost);

cudaThreadSynchronize();

double somme = 0;

for ( int i = 0; i < dimGrid.x; i++ )

	somme += (double)h_out[i];

return somme;

}

/************************************************************


  • Lecture des fichiers raw

*********/

void readRaw(float *tab, size_t count, char *fileName) {

FILE * pFile;

char * buffer;

size_t result;

pFile = fopen ( fileName , "rb" );

if (pFile==NULL) {fputs ("File error",stderr); exit (1);}

// allocate memory to contain the whole file:

buffer = (char*) malloc (sizeof(char)*count);

if (buffer == NULL) {fputs ("Memory error",stderr); exit (2);}

// copy the file into the buffer:

result = fread (buffer, 1, count, pFile);

if (result != count) {fputs ("Reading error",stderr); exit (3);}

// copy the buffer into the array

for (int i = 0; i < (int)count; i++) {

	tab[i] = (float)buffer[i];

}

}

/************************************************************


  • Main

**********/

int main(int argc, char** argv) {

float		*h_data, *d_result, *d_data, *h_result;

unsigned int hTimer;

cutCreateTimer(&hTimer);

h_result = (float *)malloc( ((512*512)/128/2)*sizeof(float));

h_data	= (float *)malloc(512*512*sizeof(float));

readRaw(h_data, (512*512), "../image_raw_cuda/imageREF512x512_37deg_150.raw");

cudaMalloc( (void **)&d_result, ((512*512)/128/2)*sizeof(float));

cudaMalloc( (void **)&d_data, ((512*512)*sizeof(float)));



cudaMemcpy( d_data, h_data, 512*512*sizeof(float), cudaMemcpyHostToDevice );

dim3	blockSizeEXT(128);

dim3	gridSizeEXT((512*512)/128/2);

int		smemSizeRED = 128*sizeof(float);

cutResetTimer(hTimer);

cutStartTimer(hTimer);

for (int i=0; i<10000; i++)

	somme(gridSizeEXT, blockSizeEXT, smemSizeRED, d_data, d_result, h_result, ((512*512)/128/2)*sizeof(float));

cutStopTimer(hTimer);

double gpuTime = cutGetTimerValue(hTimer);

printf("Temps de calcul : %fmsec\n", gpuTime/10000);

system("PAUSE");

}[/codebox]

2nd Kernel :

[codebox]#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#define IMUL(a,b ) __mul24(a,b )

// Declaration d’une texture 2D

texture<float, 2, cudaReadModeElementType> dataTest_tex;

texture<float, 2, cudaReadModeElementType> &getTexture() { return dataTest_tex; }

static inline device texture<float, 2, cudaReadModeElementType> &getDeviceTexture() { return dataTest_tex; }

global void bilinear(float *d_out, int dimIn, int dimOut, float ax, float ay, float bx, float by, float cx, float cy) {

int ix		= IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

int iy		= IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

float xp	= ax*ix + bx*iy + cx;

float yp	= ay*ix + by*iy + cy;

/* int w1 = int(xp);

int h1		= int(yp);*/

d_out[iy*dimOut + ix] = tex2D(getDeviceTexture(), xp, yp);

}

/************************************************************


  • Lecture des fichiers raw

*********/

void readRaw(float *tab, size_t count, char *fileName) {

FILE * pFile;

char * buffer;

size_t result;

pFile = fopen ( fileName , "rb" );

if (pFile==NULL) {fputs ("File error",stderr); exit (1);}

// allocate memory to contain the whole file:

buffer = (char*) malloc (sizeof(char)*count);

if (buffer == NULL) {fputs ("Memory error",stderr); exit (2);}

// copy the file into the buffer:

result = fread (buffer, 1, count, pFile);

if (result != count) {fputs ("Reading error",stderr); exit (3);}

// copy the buffer into the array

for (int i = 0; i < (int)count; i++) {

	tab[i] = (float)buffer[i];

}

}

/************************************************************


  • Main

**********/

int main(int argc, char** argv) {

float		*h_data, *d_result;

cudaArray	*d_data;

unsigned int hTimer;

cutCreateTimer(&hTimer);

h_data	= (float *)malloc(1024*1024*sizeof(float));

readRaw(h_data, (1024*1024), "../image_raw_cuda/imageTEST1024x1024.raw");

cudaMalloc( (void **)&d_result, 512*512*sizeof(float));

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

cudaMallocArray( &d_data, &channelDesc, 1024, 1024);

cudaMemcpyToArray( d_data, 0, 0, (void*)h_data, 1024*1024*sizeof(float), cudaMemcpyHostToDevice);

texture<float, 2, cudaReadModeElementType> &myTexture = getTexture();

myTexture.filterMode = cudaFilterModeLinear;	// Interpolation intégrée

myTexture.normalized = false;

cudaBindTextureToArray( myTexture, d_data, channelDesc);

dim3	blockSizeEXT(16,16);

dim3	gridSizeEXT(512/16,512/16);

cutResetTimer(hTimer);

cutStartTimer(hTimer);

for (int i=0; i<10000; i++)

{

	bilinear<<< gridSizeEXT, blockSizeEXT >>>(d_result, 512, 512, 0.1, 0.1, 0.1, 0.1, 256, 256);

	cudaThreadSynchronize();

}

cutStopTimer(hTimer);

double gpuTime = cutGetTimerValue(hTimer);

printf("Temps de calcul : %fmsec\n", gpuTime/10000);

system("PAUSE");

}[/codebox]

I just can say that the pc with GTX285 have a better processor (core 2 quad Q9550) and faster RAM than the pc with geforce 8. The mac have better processor (double intel xeon quad core) and more ram (8Gb) than the pc with GTX285.

When we experimented with a 8800 GTS 512 and a GTX280 (small number of kernels) we saw, that the GTX has a higher overhead. The profiler clearly showed (difference between GPU and CPU run time) that the the GTX add constantly 40 microsecs, while the smaller GTS just added 20 microsecs. In big problem size this doesnt matter, as the GTX has far more processors, but when the problem size goes down, you can drop below the break even point, where the 20 additional microsecs cant be compensated by the higher number of parallel threads.

131072 threads for kernel 1 and 262144 threads for kernel 2 is not very much. If you compare that with the 10000 memory copies and cpu calls, you are not really testing gpu performance.
Use just several mem copies and millions of threads and time that.

simulation credit auto

Hi you guys.

I’m a newbie. This is the first time I’ve posted in this thread.

Hope you all have a good day^^ External Image

Thanks for your answer.

I understand what you say but I can’t modify number of threads because my final code will maybe use less threads. I work on images and I can’t fix myself their dimensions.

I just want to know why the 8800 is faster. The same code is executed so 8800 executes the same number of memory copies and cpu calls you know what I mean?

Then hopefully your kernels will get more complex, otherwise memory bandwidth will be your bottleneck. GPGPU is general much more efficient with many threads and many calculations per thread.

The 8800 is not faster. Apparently the bus/MB/cpu/RAM/etc combination is faster in that machine. What happens if you swap hardware?

If I can I will try that.

I found the problem…

It comes from Visual Studio. When I launch my soft, I use F5 command in visual studio. I think that visual studio make some test or other things when the program is running even if I launch in release mode. So when I launch myself the .exe, the program is 50% faster than when I use visual studio…

Thank you for your help.