OpenCL execution on K4000 - slow execution and very laggy user interface response

Joggel · November 24, 2014, 4:06pm

I am working on a medical software for MRI brain segmentation. The software is mainly developed for cross-platform execution AMD and Nvidia GPUs. I get speed up of around 3~4, according to the used system.
For testing on a Nvidia system, I have access on a CUDA machine with a Tesla C2075 for compute and a AMD FirePRO V5900 for rendering. There the software executes without any problems. The speed up is very good.
I’ve deployed the software on a different system with a Quadro K4000(GK106) for compute and graphics, I experience a strange behavior. The softwares execution is very slow, as well the whole user interface becomes very laggy and un-responsive. The system is a Windows 7. My first impression is, that the GK106 doesn’t have Hyper-Q and cannot launch multiple tasks simultaneously on the GPU. Still the UI should just feel laggy, but the software’s execution should occur without much drawback, but this is not the case. The software uses many load and store queues for execution because its a EM algorithm with utilizes many statistical & stochastic methods.

It is now discussed to integrate another GPU for general graphics processing, in order to overcome this problem, but as long it is not known, what the problem is, it would be wasted money.

njuffa · November 24, 2014, 5:46pm

Not knowing anything about the differences between the systems, the performance ratio observed, or your application, your observations could be due to any number of things. If you suspect the issue is GPU performance, I would suggest trying the profiler (not sure whether this works with OpenCL as I am a CUDA user).

From what I understand you are comparing a high-end system with a dedicated compute GPU plus a separate card for rendering with a mid-range system with a single GPU used for both compute and rendering? Since the GPU can do either computation or rendering, long running compute kernels, or just heavy compute load on the GPU in general can lead to a laggy UI.

Does your application use any double-precision computations? The double-precision throughput of the GK106 is significantly lower than that of the C2075 (C2075: 515 GFLOPS DP; I think the K4000 has 103 GFLOPS DP but I cannot find a reliable source right now). The specified memory bandwidth of the C2075 is a tad higher than for K4000 (144 GB/sec vs 134 GB/sec).

[Later: I cannot find comprehensive specifications for GK106 on NVIDIA’s website, but at least two third-party resources state that the DP:SP throughput ratio for GK106 is 1:24. Since the K4000 is listed a providing 1245 GFLOPS SP, this would mean only 52 GFLOPS DP).

Joggel · November 24, 2014, 6:25pm

Thank you very much for the answer.

First my software doesn’t utilize any DP-computations, so this can’t be the reason for low performance.
I’ll try to describe the general structure of the software’s algorithm, without going too much into detail. It is an EM-algorithm(Expectation-Maximization) and therefore uses heavy stochastically methods like calculating variances, means, etc. It is voxel clustering algorithm according to spatial neighborhood information and Gaussian Mixture Modell.

Some major steps:

Cluster the voxel field according to parameters
Update parameters(variance, means, proability fields, etc…)
Sync Parameters to CPU and finalize parameters
Sync back to GPU
Go back to 1 until a certain iteration depth is achieved

I can execute that software on HP Zbook14, which contains a FirePRO M4100. Just by comparing the K4000 and my M4100, the K4000 should outperform my notebook by not a small deal.

I don’t know how to profile OpenCL on Nvidia GPUs. I can tell you more tomorrow…I hope.

njuffa · November 24, 2014, 7:06pm

The biggest difference seems to be sharing a GPU for computation and rendering vs using a dedicated compute GPU. I have a Windows machine with a Quadro here, and during Folding@Home runs that tax the GPU heavily the GUI becomes quite sluggish.

You would want to perform some controlled experiments (where only one variable changes at a time) to get to the bottom of this. Can you replace the C2075 in system #1 with the Quadro K4000? The results of that swap would tell you whether the issue has anything directly to do with the different GPU.

Given that your application uses single-precision computation only, and that SP throughput and memory throughput of the two device are quite close, one might expect performance differences to be within +/-20%.

Joggel · November 25, 2014, 4:34pm

So, i was able to make some traces with GPUView/WPT. Here’s the link:

For my FirePRO M4100 there is no problem, but for the Quadro K4000 it seems there is for nearly each DMA Packet another Preemption Packet. I don’t where this comes, it’s strange. Does this come from using the GPU for compute and graphics simultaneously? Ah, the executables name is FAST-CL.exe and FAST-CL_GPU.exe, respectively.

njuffa · November 25, 2014, 6:19pm

I have no insight into this as I am not familiar with the tool, and am likewise not familiar with the “DMA Packet” and “Preemption Packet” terminology used.

Topic		Replies	Views
OpenCL Fluid Simulator on GTX480 CUDA Programming and Performance	2	18250	February 10, 2011
Quadro 4000Mac and OpenCL CUDA Programming and Performance	4	10087	December 14, 2011
GeForce Titan X (3072 cores) and Quadro K2000M (384 cores) same performance Teaching and Curriculum Support	3	1923	October 11, 2015
OpenCL Provides No Additional Performance CUDA Programming and Performance	6	1977	February 19, 2010
OpenCL performs better than CUDA CUDA Programming and Performance	4	11782	March 1, 2011
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19381	November 26, 2015
OpenCL: Dot Product Sample Code CUDA Programming and Performance	2	4118	February 21, 2010
OpenCL on Windows much slower than on Mac? A simple convolution test CUDA Programming and Performance	27	14484	July 3, 2010
Display driver stopped responding and has recovered CUDA Programming and Performance	6	10261	February 22, 2010
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55678	July 15, 2010

OpenCL execution on K4000 - slow execution and very laggy user interface response

Related topics