A new Windows 7 OpenCL app, with manual workload distribution.

A new Windows 7 OpenCL app, with manual workload distribution tuning, across multiple GPUs.

(Post 137 to download.)

You might find it interesting… ;)

http://forum.beyond3d.com/showthread.php?t=55913&page=6

It was made by my new favorite programmer, David Bucciarelli.

It works well: http://www.luxrender.net/wiki/index.php?ti…nder_and_OpenCL

I was doing more testing running the latest version, after manually adjusting each of my 3 GPU’s to 33% utilization, for optimum workload balance.

I began forcing different Work Group sizes using a bat file, to see how it would affect my systems performance.

Example of the .bat file used to force Work Group size 224.

smallptGPU.exe 1 1 224 1024 768 scenes\cornell.scn

Size 8 -> 2,721k Samples/sec

Size 16 -> 2,967k Samples/sec

Size 32 -> 3,054k Samples/sec

Size 64 -> 4,915k Samples/sec

Size 96 -> 3,602k Samples/sec

Size 128 -> 4,451k Samples/sec

Size 160 -> 4,915k Samples/sec

Size 192 -> 5,041k Samples/sec

Size 224 -> 3,978k Samples/sec

Size 256 -> 4,321k Samples/sec

Size 320 -> 4,802k Samples/sec

Size 384 -> 5,173k Samples/sec

Size 448 -> Would not run

Size 512 -> Would not run

Size 576 -> Would not run

Just running the default ‘smallptGPU’ file, I get 5,173k Samples/sec

I did noticed when I do run the default ‘smallptGPU’ file, it says the ‘Suggested work group size: 384’ in the DOS window.

That is also is the work group size, that I get my best performance on…

I tried to increase the work group size further, but the program would not run.

[b]

Question:[/b] Is it a known fact that Nvidia can’t allocate a larger work group size than 384? Just wondering…

If so, what is the limiting factor? GPU memory?

Second question: If we could further increase the Work Group size past 384, do you think we might see some additional performance?

Workgroup size (the equivalent of blocksize in CUDA) is limited by the resources the OpenCL code uses. It will be different for every piece of code. The basic mutliprocessor unit in NVIDIA GPUs has limits on Workgroup size (512 is the current limited per workgroup, 768 or 1024 total per MP depending on hardware version), registers (128 per thread and 8192 or 16384 total per MP), and shared memory (16kb per MP). How much of each of those things the kernel uses dictates the maximum workgroup size. The only way to increase it is to make the code use less resources. Sometimes it helps performance, sometimes it doesn’t.

That is totally hypothetical, and depends on the code for the reasons outlined above. It should improve up to a maxima as the workgroup size is increased, and then stay stable or even reduce after that. Whether this code has reached that point is a question I can’t answer.

Thanks for your fast response…

One more OpenCL question.

If we were to get an updated version of OpenCL for Windows, that delivered faster performance…

When I install the new graphics driver, would I realize the extra speed on my current apps?

Or, would each OpenCL program need to be recompiled first using the latest SDK?