OpenCL execution on K4000 - slow execution and very laggy user interface response

I am working on a medical software for MRI brain segmentation. The software is mainly developed for cross-platform execution AMD and Nvidia GPUs. I get speed up of around 3~4, according to the used system.
For testing on a Nvidia system, I have access on a CUDA machine with a Tesla C2075 for compute and a AMD FirePRO V5900 for rendering. There the software executes without any problems. The speed up is very good.
I’ve deployed the software on a different system with a Quadro K4000(GK106) for compute and graphics, I experience a strange behavior. The softwares execution is very slow, as well the whole user interface becomes very laggy and un-responsive. The system is a Windows 7. My first impression is, that the GK106 doesn’t have Hyper-Q and cannot launch multiple tasks simultaneously on the GPU. Still the UI should just feel laggy, but the software’s execution should occur without much drawback, but this is not the case. The software uses many load and store queues for execution because its a EM algorithm with utilizes many statistical & stochastic methods.

It is now discussed to integrate another GPU for general graphics processing, in order to overcome this problem, but as long it is not known, what the problem is, it would be wasted money.

Not knowing anything about the differences between the systems, the performance ratio observed, or your application, your observations could be due to any number of things. If you suspect the issue is GPU performance, I would suggest trying the profiler (not sure whether this works with OpenCL as I am a CUDA user).

From what I understand you are comparing a high-end system with a dedicated compute GPU plus a separate card for rendering with a mid-range system with a single GPU used for both compute and rendering? Since the GPU can do either computation or rendering, long running compute kernels, or just heavy compute load on the GPU in general can lead to a laggy UI.

Does your application use any double-precision computations? The double-precision throughput of the GK106 is significantly lower than that of the C2075 (C2075: 515 GFLOPS DP; I think the K4000 has 103 GFLOPS DP but I cannot find a reliable source right now). The specified memory bandwidth of the C2075 is a tad higher than for K4000 (144 GB/sec vs 134 GB/sec).

[Later: I cannot find comprehensive specifications for GK106 on NVIDIA’s website, but at least two third-party resources state that the DP:SP throughput ratio for GK106 is 1:24. Since the K4000 is listed a providing 1245 GFLOPS SP, this would mean only 52 GFLOPS DP).

Thank you very much for the answer.

First my software doesn’t utilize any DP-computations, so this can’t be the reason for low performance.
I’ll try to describe the general structure of the software’s algorithm, without going too much into detail. It is an EM-algorithm(Expectation-Maximization) and therefore uses heavy stochastically methods like calculating variances, means, etc. It is voxel clustering algorithm according to spatial neighborhood information and Gaussian Mixture Modell.

Some major steps:

  1. Cluster the voxel field according to parameters
  2. Update parameters(variance, means, proability fields, etc…)
  3. Sync Parameters to CPU and finalize parameters
  4. Sync back to GPU
  5. Go back to 1 until a certain iteration depth is achieved

I can execute that software on HP Zbook14, which contains a FirePRO M4100. Just by comparing the K4000 and my M4100, the K4000 should outperform my notebook by not a small deal.

I don’t know how to profile OpenCL on Nvidia GPUs. I can tell you more tomorrow…I hope.

The biggest difference seems to be sharing a GPU for computation and rendering vs using a dedicated compute GPU. I have a Windows machine with a Quadro here, and during Folding@Home runs that tax the GPU heavily the GUI becomes quite sluggish.

You would want to perform some controlled experiments (where only one variable changes at a time) to get to the bottom of this. Can you replace the C2075 in system #1 with the Quadro K4000? The results of that swap would tell you whether the issue has anything directly to do with the different GPU.

Given that your application uses single-precision computation only, and that SP throughput and memory throughput of the two device are quite close, one might expect performance differences to be within +/-20%.

So, i was able to make some traces with GPUView/WPT. Here’s the link:

For my FirePRO M4100 there is no problem, but for the Quadro K4000 it seems there is for nearly each DMA Packet another Preemption Packet. I don’t where this comes, it’s strange. Does this come from using the GPU for compute and graphics simultaneously? Ah, the executables name is FAST-CL.exe and FAST-CL_GPU.exe, respectively.

I have no insight into this as I am not familiar with the tool, and am likewise not familiar with the “DMA Packet” and “Preemption Packet” terminology used.