SLower on QUADRO 4000 vs QUADRO FX3800

Hello,

I’m using QUADRO FX38000 and QUADRO 4000.

I use OpenCl program to process a bilinear debayering and some basic color correction and gamma.

The same single Opencl process running on both board:
FX3800 = 22ms
4000 = 41 ms

Does anyone experience such result ?
Are they new recommendation coming with the new NVIDIA GPU ?

Thanks in advance.
Pascal

Note that the Quadro FX 3800, despite the lower number, is a more recent GPU than the FX 4000. While the first is GT200-based, the latter is NV40GL-based, so it’s not very surprising that the FX 3800 is faster / requires less time. For details see e.g. [1].

[1] http://en.wikipedia.org/wiki/Nvidia_Quadro#PCI_Express

Actually the Quadro 4000 is a Fermi device and is the more recent one. The 4000 has 256 CUDA cores vs 192 on the fx3800 and I think something like 90GB/s vs 51GB/s bandwidth. If memory serves it is a GF100 architecture vs a GT200.

The are two things that you should check:

  1. I don’t recall if the 4000 has ecc memory or not, I think not, but if it does and it’s enabled, it can cause a 12.5% difference in speed or something like that.

  2. The different architecture means a different core/multicore setup. The 4000 has 32 cores per multicore instead of 8, which means a different expected occupancy. It also has L1 one cache sharing transistors with shared memory in a 16/48 kb split (Configurable, although not sure how under OpenCL). It has 32banks with a two cycle access time, which means that the first half of the warp can cause bank conflicts with the second half. Also, going via L1 cache means that you are always reading 128bytes per global memory access rather than a minimum of 32bytes on the fx3800. Global memory accesses are in full warps as well, rather than half warps if I’m not mistaken.

The three important things to do:

  1. Change the block sizes to be a multiple of 32 threads in width rather than the common 16 under pre-fermi cards

  2. Check shared and global memory accesses to see that the match Fermi requirements

  3. Check L1 cache misses to see if you are accessing memory as expected

The best thing is to run your application under the profiler as a start and see where things break down.

Oh, right, I was misreading “Quadro 4000” as “Quadro FX 4000”, sorry.