WHY GTX295 slower as FX1700

kemu · March 25, 2010, 3:24pm

hi all,

I test my programms now on a computer with a GTX295. The results of them are all slower as my old test results on FX1700. Why?..

seibert · March 25, 2010, 3:31pm

Without more information, this is unanswerable. What was the limiting factor in kernel performance before? FLOPS? Memory bandwidth? PCI-Express bandwidth? How much slower is the GTX 295 than the FX1700?

(Partition camping also seems to be a popular answer to problems these days, so maybe that is involved…)

TrekCZ · March 25, 2010, 3:31pm

Are you really sure that it is running really on GTX295 and not some sort of emulator or so?
Use profiler to see what is happening.

SPWorley · March 25, 2010, 4:21pm

It’s really likely you’re using HALF of your GTX295… it’s a dual GPU card and it’d be easy to accidentally use one GPU and leave the other GPU idle.
Is your app multi-GPU aware and is it finding and using both GPUs?

seibert · March 25, 2010, 4:24pm

Even half a GTX 295 should be faster than a FX1700, I think…

kemu · March 26, 2010, 1:41pm

I’ve tested the matrixMul SKD. On my computer with FX1700 needs about 0.815ms. on the computer with gtx295 needs 2.2ms!

The Fx1700 has XP os, and gtx295 has win 7 os. Maybe is it the reason?

kemu · March 26, 2010, 2:04pm

yes, im sure, the emulator mode is deactivated…

kemu · March 26, 2010, 2:05pm

yes, the fx1700 has just 4 cuda cores and 295 has 2*30… so why?

:angry:

LSChien · March 26, 2010, 2:40pm

I take matrixMul in CUDA SDK 2.3 as example

(1) in matrixMul.h

#define WA (3 * BLOCK_SIZE) // Matrix A width

#define HA (5 * BLOCK_SIZE) // Matrix A height

#define WB (8 * BLOCK_SIZE) // Matrix B width

only 40 thread-blocks are used, so few, you should try bigger dimension

(2) main reason, in matrixMul.cu

// create and start timer

	unsigned int timer = 0;

	cutilCheckError(cutCreateTimer(&timer));

	cutilCheckError(cutStartTimer(timer));

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(WC / threads.x, HC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	// copy result from device to host

	cutilSafeCall(cudaMemcpy(h_C, d_C, mem_size_C,

							  cudaMemcpyDeviceToHost) );

	// stop and destroy timer

	cutilCheckError(cutStopTimer(timer));

	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

	cutilCheckError(cutDeleteTimer(timer));

(1) you should avoid warm-up time

(2) you should not include PCI transfer,

you can use following code

// remove warm-up time 

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	

	// create and start timer

	unsigned int timer = 0;

	cutilCheckError(cutCreateTimer(&timer));

	cutilCheckError(cutStartTimer(timer));

	// setup execution parameters

	dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

	dim3 grid(WC / threads.x, HC / threads.y);

	// execute the kernel

	matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	// stop and destroy timer

	cutilCheckError(cutStopTimer(timer));

	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

	cutilCheckError(cutDeleteTimer(timer));

	

   // copy result from device to host

	cutilSafeCall(cudaMemcpy(h_C, d_C, mem_size_C,

							  cudaMemcpyDeviceToHost) );

but try bigger dimension.

In my experiment, if you take square matrices with n = 4096, then

kernel costs 693 ms whereas CUBLAS costs 372.5 ms

kemu · March 26, 2010, 3:18pm