OpenCL issues with OS X OpenCL

Hi there,

I have an issue with slow bandwidth in OpenCL on my SnowLeopard OS X. First I thought it was NVIDIAs fault something with the driver etc. So I decided to compile it with just the OpenCL framework provided by apple taking the files from NVIDIAs sdk and building the application just with g++ *.cpp -framework OpenCL -o oclBandwidthTest

I get this rather interesting and also annoying result which seem to say that there is a problem on how OS X treats my GPU card (GTX-275 1792MB) but I do not believe this:

./oclBandwidthTest64 Starting…

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on…

GeForce GTX 275

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2243.4

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2343.1

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6134.7

PASSED

Press to Quit…

Since I am interested in making my machine workable with GPGPU it is necessary that the first two approach 4GB/sec and the last to exceed 100GB/sec

So I am attaching two files the first is 64bit (oclBandwidthTest64) the second 32bit(oclBandwidthTest32) so as for you to test it and post the results here suggesting also, if you can, what I should do in order to fix this situation.

I do not believe there is an issue with OpenCL or my OS X since in MatrixMultiplication I get 220ms execution time opposed to 120ms in Linux. This means that the card functions as it should. If the memory bandwidth was 6GB/sec then it would be deadly slow. I will try to tweak the oclBandwidthTest file to see the actual time execution. There might be a problem just on how you folks in NVidia implemented the test and it shows this peculiar result in OS X.

Alex.

For paged memory / direct access, your host <-> device numbers look OK to me, i.e. I have roughly similar numbers on a GTX 285 on Windows. You should get higher throughput with pinned memory / direct access, but you won’t come close to PCIe’s theoretical 4 GB/s limit (for 16 lanes). IIRC, the best I’ve seen so far is about 3160 MB/s for device to host transfer.

However, your device <-> device bandwidth indeed is far to low. I’m seeing about 112000 MB/s there.

What machine are you running these tests on?

In Linux I see device ↔ device bandwidth I see roughly the same as you do (in Windows) since we both have the same GDDR3 memory. It is interesting that Matrix Multiplication works ok. This means that there is no problem with the GPU or OpenCL/CUDA. Something extremely bizare is going on with the OS maybe their formula to get the bandwidth. I will see the code and post the actual time it takes for this transfer. It seems also that NVIDIA needs to do more in OpenGL. OpenCL and OpenGL seem to be the right choise for the Mac OS since you can do a lot of interesting stuff with their context and mix them together.

Anyone has an iMac, Mac Pro, Mac Book, OSx86 system to experiment with the executable I have given on their cards?

Alex.

I have a custom build (PC parts) which reads as a Mac Pro 5,1 in the OS output. The card seem to be successfully recognized by the System otherwise the driver would not work. Correct me if I am wrong.

CPU : Intel I7-920 12GB DDR3

GPU : MSI Lighting GTX-275 1792MB

The nice with MacOS is that it reads two devices the CPU and GPU thus I can experiment in both architectures. Correct me if I am wrong. But I need to be certain that I am not going to code on a underutilized GPU. The Mac Store should be open today also time to get a developers license if I sort out all my troubles. :-)

Best,

Alex.

What do you know I found a bug in NVidias SDK for MAC OS X. Folks in NVIDIA check your time in shrutils.cpp it is erroneous for the OS X. It runs the Linux version but it seems that it is wrong for OS X. I am certain that whichever machine runs this test it will get the same result.

Here is the actual bandwidth which is sane. Correct me if I using clock_t is not correct. I think it is.

./oclBandwidthTest64 Starting…

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on…

GeForce GTX 275

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2313.8

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2398.0

Memory Transfer took 0.046580 seconds using C clock_t

Memory Transfer took 1.068621 seconds using Nvidias clock

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 137398.0

PASSED

Press to Quit…


I have just exterminated the bug…

Also you should put clFinish inside the loop. Then you can get these right results. You should force in each loop the queue to terminate each memory transfer…

./oclBandwidthTest64 Starting…

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on…

GeForce GTX 275

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3048.8

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3081.0

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 162684.3

PASSED

;-)

For those who have Mac OS X 10.6.5 (probably all) here is the correct method to test in Direct mode the bandwidths. I have corrected only the Direct mode not the Pinned one. Also I have replaced CL_DEVICE_TYPE_GPU to CL_DEVICE_TYPE_ALL so that you can test the CPU memory transfer and see that you get 8GB/sec Device to Device which is correct for DDR3 1066MHz. So this means that the CPU architecture can be coded in a uniform way as the GPU with a Mac. :-)

Here is the output which looks correct:

localhost:oclBandwidthTest agalex$ ./oclBandwidthTest64 --device=0
./oclBandwidthTest64 Starting…

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on…

GeForce GTX 275

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3056.9

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3095.0

1048576
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 166579.9

PASSED

Press to Quit…

localhost:oclBandwidthTest agalex$ ./oclBandwidthTest64 --device=1
./oclBandwidthTest64 Starting…

WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!

Running on…

Intel(R) Core™ i7 CPU 920 @ 2.67GHz

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4480.5

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4492.7

1048576
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8699.8

PASSED

Press to Quit…

Are you a registered developer and have you reported the issue to the bug tracker? If not, please do so. If you’re not a registered developer, please attach your patches to the source code here and I will create a bug report for you. Thanks.

No I am not registered at least not for now. I am just beginning to learn OpenCL. I am not also certain if the patch I propose actually solves the problem. The problem is in their distribution of OpenCL SDK and the specific program is oclBandwidthTest. It has been tested by a Mac Notebook and at my machine and it produces erroneous results.

For instance on the Mac book the output is :

Running on…

GeForce 9400M

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3605.7

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 4538.0

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5370.5

PASSED

So the results are clearly wrong. This can be fed I guess to the bug report. Unfortunately I do not have time to experiment more since I have a time deadline for another activity.

Alexander.

Don’t get me wrong, I was not asking for any extra work. I though you already have a patch to the source code of oclBandwidthTest and / or shrUtils ready (or could easily create one by diffing the original to your current version) and could post that patch(es) here, so people can compile their own binaries of the fixed examples.

Yes I do have my version of these two files. I am attaching them.

Unzip the file BanndwidthTest in a Directory and type this in the console window.

g++ -I./ *.cpp -framework OpenCL -o oclBandwidthTest

then ./oclBandwidthTest and see the results.

Alexander.

I’ve reported this as issue 780007 to NVIDIA.