Get warp size from Get warp size from OpenCL No information about waro size from clGetDeviceInfo

Please bear with me on this one.

We know that GPUs have no scalar capabilities. We look at work units as threads, make some small constraints on how to organize them, and voila! we have an insanely fast, optimized program without even taking vectorization into account. In fact, we took care of vectorization when we took care of the aforementioned “small constraints”.

Now let’s take the case of Progagod, a fictional russian programmer. He hears of OpenCL, and, during the weekend, he ports his simulation program to OpenCL. He uses OpenCL on the CPU, an everything works as expected, he also tests it on some Cell processors at work, no problem. Then Progagod hears about how ultra-fast GPU’s are, he takes some an tests them. he gets pathetic performance.

What happened is that Progagod’s program had no way of knowing that the CUDA “warp size” is 32, and that each compute unit needs at least 32 CUDA “threads”. It checked CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT (reported as 1) and CL_DEVICE_MAX_COMPUTE_UNITS to determine how to spread calculations across. I believe CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT should really be reported as 32, not 1 by OpenCL. There is no other way to use clGetDeviceInfo to fins the warp size, which is in fact the vector width of these massive vector machines. The only other way is hard-coding, which is both non-portable and a tad bit ugly.

AMD’s CPU implementation reports CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT as 4 and CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE as 2, which is what it in fact is.

Is there a way to get the warp size, or information about it from OpenCL, or is this a problem that may have gone unnoticed?

Please bear with me on this one.

We know that GPUs have no scalar capabilities. We look at work units as threads, make some small constraints on how to organize them, and voila! we have an insanely fast, optimized program without even taking vectorization into account. In fact, we took care of vectorization when we took care of the aforementioned “small constraints”.

Now let’s take the case of Progagod, a fictional russian programmer. He hears of OpenCL, and, during the weekend, he ports his simulation program to OpenCL. He uses OpenCL on the CPU, an everything works as expected, he also tests it on some Cell processors at work, no problem. Then Progagod hears about how ultra-fast GPU’s are, he takes some an tests them. he gets pathetic performance.

What happened is that Progagod’s program had no way of knowing that the CUDA “warp size” is 32, and that each compute unit needs at least 32 CUDA “threads”. It checked CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT (reported as 1) and CL_DEVICE_MAX_COMPUTE_UNITS to determine how to spread calculations across. I believe CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT should really be reported as 32, not 1 by OpenCL. There is no other way to use clGetDeviceInfo to fins the warp size, which is in fact the vector width of these massive vector machines. The only other way is hard-coding, which is both non-portable and a tad bit ugly.

AMD’s CPU implementation reports CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT as 4 and CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE as 2, which is what it in fact is.

Is there a way to get the warp size, or information about it from OpenCL, or is this a problem that may have gone unnoticed?

According to OpenCL spec you can call clGetKernelWorkGroupInfo with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and make block-size multiple of this size.

According to OpenCL spec you can call clGetKernelWorkGroupInfo with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and make block-size multiple of this size.