I just profiled the matrixMul example on my MacBook pro (9600M GT) and my desktop computer (GTX 285 - Ubuntu Linux 10.04, 64bit). Looking at the table I noticed that the execution of the cuda kernels is enormously faster on the desktop compared to my mb (gpu time 81993.9 microsec on mb VS 16.768 microsec on desktop for one kernel invocation !!) First question: Is this possible or is something wrong?
Wondering about why this is I noticed that the instructions differ also by an enormous amount. Why should the number of instructions (per thread or for all threads?!) differ that much on different GPUs?
-using --fast-math option on one machine and not on the other?
-different sm_xx architecture specified?
But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)
-using --fast-math option on one machine and not on the other?
-different sm_xx architecture specified?
But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)
Ah yeah on the Mac it’s 32 bit (64 bit is not supported IIRC), on the Linux machine 64 bit… but why should this affect floating point performance on the GPU?
using --fast-math option on one machine and not on the other?
No.
But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)
Yes it does this. But the grid size is bigger on the mb pro, have a look at the attachment. I’m not sure what role the grid size plays in the distribution of threads to the SM. However the 9600M GT has just 4 while the GTX 285 has 30, so much higher parallelization should be possible
Ah yeah on the Mac it’s 32 bit (64 bit is not supported IIRC), on the Linux machine 64 bit… but why should this affect floating point performance on the GPU?
using --fast-math option on one machine and not on the other?
No.
But there should really be not such a huge difference, hmm… Does the code calling the kernel make different choices of grid and block size depending on hardware specifications? Maybe it queries the driver for capabilities about the hardware and uses a larger grid to achieve higher level of parallelism on the GTX 285… Then the kernel is done more quickly because it does less work per multiprocessor (which would match nicely the lower instruction count you observed)
Yes it does this. But the grid size is bigger on the mb pro, have a look at the attachment. I’m not sure what role the grid size plays in the distribution of threads to the SM. However the 9600M GT has just 4 while the GTX 285 has 30, so much higher parallelization should be possible
Ouhh I just saw that the matrix size is also calculated dynamically based on the GPU on the mb it’s (640 x 1280) and on the desktop (80 x 160) !! :( which explains it, I thought it would be the same on any platform…
Ouhh I just saw that the matrix size is also calculated dynamically based on the GPU on the mb it’s (640 x 1280) and on the desktop (80 x 160) !! :( which explains it, I thought it would be the same on any platform…
Added -sizemult to match the size on the mb, not sure why it gets a smaller size on a card with much more SMs (is this a bug?). The results now look far more reasonable.
Added -sizemult to match the size on the mb, not sure why it gets a smaller size on a card with much more SMs (is this a bug?). The results now look far more reasonable.
The profiler shows results from one texture cluster only. Since the desktop card has many more of them, each cluster gets a smaller share of instructions to execute…
The profiler shows results from one texture cluster only. Since the desktop card has many more of them, each cluster gets a smaller share of instructions to execute…