No, this is the number of work items within a work group, not the whole NDRange (grid). I believe there’s no way to query maximum size of NDRange.
Daniel, FYI there’s an error in your code that will manifest with newer driver versions:
cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);
This should not work, you will be required to supply a valid platform id instead of 0.
Now, as for the source of your problems:
You have roughly 8 million floats to add. Two input vectors, one output, 16M reads and 8M writes - the amount of computation it’s negligible, we’re memory bound. 64MB of reads, 32MB of writes.
Your CPU realistic read/write RAM bandwidth is around 5GB/s. So the best time you get is simply 64MB/(5GB/s) + 32MB/(5GB/s) ~= 0,01875 s. You’re not getting anywhere near this probably because your loop is not unrolled and you’re not using SSE vector loads and stores. It’s a theoretical peak anyway, never mind that.
With the GPU, you need to copy the data to device (2GB/s if you’re lucky on your platform and without pinned memory), compute (6.4 GB/s peak bandiwdth on a 8400, 5GB/s realistic) and copy back (2GB/s). You can probably see the problem right here but let’s do the calculations anyway:
64MB/(2GB/s) + 96MB/(5GB/s) + 32MB/(2GB/s) ~= 0,065625 s
Analytical difference in performance is 3.5x. Your CPU should be that faster than your GPU in this task.
Vector addition is simply not a good problem for the GPU. You’re not exploiting neither its superior memory bandwidth (well, not really superior in the 8400 but generally), since most of the time is wasted on device<->host transfers, nor the compute throughput.