CPU vs GPU optimzations


I have implemented a straightaway naive matrix multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for just an 8-core CPU system while I only run it on CPUs. I have applied some popular optimizations like utilizing private memory and local memory optimizations, and grouping my matrix in one dimension so I use both global and local dimension sizes. Now I get Speedup of around 24 with same 8-core CPU.
First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup…

Please help me clarify because I am so new to OpenCL and its giving me so big performance I cant beleive it, I have verified results and they are perfectly accurate.
Thanks in advance

What are you comparing with? A naive, scalar, single-threaded, one-element-at-the-time C implementation will be pretty slow and a 24 time improvement over that does not surprise me. In the best case, you could get 8 times performance for the cores, 4 times for the OpenCL compiler using SIMD instructions and a few more for proper cache utilization. A good exercise is to see how many of these techniques that are missing from your baseline implementation and see how close to the OpenCL performance you can get by adding them.

Regarding the internals of how a CPU implementation might handle all of this, I have no idea. Asking in AMD’s or Intel’s forums might give better answers.

Okay thank you so much:)