Profiling the matrixMul exmaple. Why does the number of instructions vary on different hardware?

I just profiled the matrixMul example on my MacBook pro (9600M GT) and my desktop computer (GTX 285 - Ubuntu Linux 10.04, 64bit). Looking at the table I noticed that the execution of the cuda kernels is enormously faster on the desktop compared to my mb (gpu time 81993.9 microsec on mb VS 16.768 microsec on desktop for one kernel invocation !!) First question: Is this possible or is something wrong?

Wondering about why this is I noticed that the instructions differ also by an enormous amount. Why should the number of instructions (per thread or for all threads?!) differ that much on different GPUs?

Some things to check:

-compiled in 32 bit vs 64 bits?

-using --fast-math option on one machine and not on the other?

-different sm_xx architecture specified?

But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)

Some things to check:

-compiled in 32 bit vs 64 bits?

-using --fast-math option on one machine and not on the other?

-different sm_xx architecture specified?

But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)

Maybe the laptop is running the GPU in a low power mode.

Maybe the laptop is running the GPU in a low power mode.

  • compiled in 32 bit vs 64 bits?

Ah yeah on the Mac it’s 32 bit (64 bit is not supported IIRC), on the Linux machine 64 bit… but why should this affect floating point performance on the GPU?

  • using --fast-math option on one machine and not on the other?

No.

But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)

Yes it does this. But the grid size is bigger on the mb pro, have a look at the attachment. I’m not sure what role the grid size plays in the distribution of threads to the SM. However the 9600M GT has just 4 while the GTX 285 has 30, so much higher parallelization should be possible

  • compiled in 32 bit vs 64 bits?

Ah yeah on the Mac it’s 32 bit (64 bit is not supported IIRC), on the Linux machine 64 bit… but why should this affect floating point performance on the GPU?

  • using --fast-math option on one machine and not on the other?

No.

But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)

Yes it does this. But the grid size is bigger on the mb pro, have a look at the attachment. I’m not sure what role the grid size plays in the distribution of threads to the SM. However the 9600M GT has just 4 while the GTX 285 has 30, so much higher parallelization should be possible

It would use the 9400M then. I think.

It would use the 9400M then. I think.

Ouhh I just saw that the matrix size is also calculated dynamically based on the GPU on the mb it’s (640 x 1280) and on the desktop (80 x 160) !! :( which explains it, I thought it would be the same on any platform…

Ouhh I just saw that the matrix size is also calculated dynamically based on the GPU on the mb it’s (640 x 1280) and on the desktop (80 x 160) !! :( which explains it, I thought it would be the same on any platform…

Added -sizemult to match the size on the mb, not sure why it gets a smaller size on a card with much more SMs (is this a bug?). The results now look far more reasonable.

Thx for your quick answers!
NvidiaMatrixMulGTX285_640x1280.csv.zip (1.08 KB)

Added -sizemult to match the size on the mb, not sure why it gets a smaller size on a card with much more SMs (is this a bug?). The results now look far more reasonable.

Thx for your quick answers!

The profiler shows results from one texture cluster only. Since the desktop card has many more of them, each cluster gets a smaller share of instructions to execute…

The profiler shows results from one texture cluster only. Since the desktop card has many more of them, each cluster gets a smaller share of instructions to execute…

What’s a texture cluster in CUDA speech?

What’s a texture cluster in CUDA speech?

A group of N multiprocessors, where N varies between hardware generations (IIRC 2 for compute 1.0/1.1 and 3 for compute 1.3).

A group of N multiprocessors, where N varies between hardware generations (IIRC 2 for compute 1.0/1.1 and 3 for compute 1.3).