WHY GTX295 slower as FX1700

hi all,

I test my programms now on a computer with a GTX295. The results of them are all slower as my old test results on FX1700. Why?..

Without more information, this is unanswerable. What was the limiting factor in kernel performance before? FLOPS? Memory bandwidth? PCI-Express bandwidth? How much slower is the GTX 295 than the FX1700?

(Partition camping also seems to be a popular answer to problems these days, so maybe that is involved…)

Are you really sure that it is running really on GTX295 and not some sort of emulator or so?
Use profiler to see what is happening.

It’s really likely you’re using HALF of your GTX295… it’s a dual GPU card and it’d be easy to accidentally use one GPU and leave the other GPU idle.
Is your app multi-GPU aware and is it finding and using both GPUs?

Even half a GTX 295 should be faster than a FX1700, I think…

I’ve tested the matrixMul SKD. On my computer with FX1700 needs about 0.815ms. on the computer with gtx295 needs 2.2ms!

The Fx1700 has XP os, and gtx295 has win 7 os. Maybe is it the reason?

yes, im sure, the emulator mode is deactivated…

yes, the fx1700 has just 4 cuda cores and 295 has 2*30… so why?

:angry:

I take matrixMul in CUDA SDK 2.3 as example

(1) in matrixMul.h

#define WA (3 * BLOCK_SIZE) // Matrix A width

#define HA (5 * BLOCK_SIZE) // Matrix A height

#define WB (8 * BLOCK_SIZE) // Matrix B width

only 40 thread-blocks are used, so few, you should try bigger dimension

(2) main reason, in matrixMul.cu

// create and start timer

	unsigned int timer = 0;

	cutilCheckError(cutCreateTimer(&timer));

	cutilCheckError(cutStartTimer(timer));

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(WC / threads.x, HC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	// copy result from device to host

	cutilSafeCall(cudaMemcpy(h_C, d_C, mem_size_C,

							  cudaMemcpyDeviceToHost) );

	// stop and destroy timer

	cutilCheckError(cutStopTimer(timer));

	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

	cutilCheckError(cutDeleteTimer(timer));

(1) you should avoid warm-up time

(2) you should not include PCI transfer,

you can use following code

// remove warm-up time 

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	

	// create and start timer

	unsigned int timer = 0;

	cutilCheckError(cutCreateTimer(&timer));

	cutilCheckError(cutStartTimer(timer));

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(WC / threads.x, HC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	// stop and destroy timer

	cutilCheckError(cutStopTimer(timer));

	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

	cutilCheckError(cutDeleteTimer(timer));

	

   // copy result from device to host

	cutilSafeCall(cudaMemcpy(h_C, d_C, mem_size_C,

							  cudaMemcpyDeviceToHost) );

but try bigger dimension.

In my experiment, if you take square matrices with n = 4096, then

kernel costs 693 ms whereas CUBLAS costs 372.5 ms

thx for your explains and experiments!! To test it I have also used programs with different/bigger block dimensions. It does not matter, whether includes the warm-up time, beacuse the both computers use identical codes…