Please bear with me on this one.
We know that GPUs have no scalar capabilities. We look at work units as threads, make some small constraints on how to organize them, and voila! we have an insanely fast, optimized program without even taking vectorization into account. In fact, we took care of vectorization when we took care of the aforementioned “small constraints”.
Now let’s take the case of Progagod, a fictional russian programmer. He hears of OpenCL, and, during the weekend, he ports his simulation program to OpenCL. He uses OpenCL on the CPU, an everything works as expected, he also tests it on some Cell processors at work, no problem. Then Progagod hears about how ultra-fast GPU’s are, he takes some an tests them. he gets pathetic performance.
What happened is that Progagod’s program had no way of knowing that the CUDA “warp size” is 32, and that each compute unit needs at least 32 CUDA “threads”. It checked CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT (reported as 1) and CL_DEVICE_MAX_COMPUTE_UNITS to determine how to spread calculations across. I believe CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT should really be reported as 32, not 1 by OpenCL. There is no other way to use clGetDeviceInfo to fins the warp size, which is in fact the vector width of these massive vector machines. The only other way is hard-coding, which is both non-portable and a tad bit ugly.
AMD’s CPU implementation reports CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT as 4 and CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE as 2, which is what it in fact is.
Is there a way to get the warp size, or information about it from OpenCL, or is this a problem that may have gone unnoticed?