Changes in Interpretation of DeviceInfo values

I changed two things at the same time (8800 to 480) & (196 to 197). Values returned by getDeviceInfo() are of course different, but I do not know which change caused some of the differences. I am actually basing this on a 195 print out, but I am pretty sure 196 is the same.

Here are the differences I know are caused solely by the 480, to get them out of the way:

  • Global Mem Cache Type is now CL_READ_WRITE_CACHE instead of CL_NONE
  • Global Mem Cacheline Size is now 128 instead of 0
  • Global Mem Cache Size & Global Mem Size is now 1503mb instead of 768mb
  • Local Mem Size is now 48kb instead of 16kb
  • Preferred Vector Width Double is now 1 instead of 0
  • New Extensions available: the 4 atomics extensions & cl_khr_fp64

Here are those which I cannot absolutely pin to the card change, & having both cards in the machine does not work:

  • Max Compute Units for (8800/196): 16 for (480/197): 15. I think this value must have been redefined from being 1 unit for every 8 cores to 1 for every 32. OpenCL Programming Guide Appendix A Still shows it 1 for 8. FYI,…ammingGuide.pdf shows version 2.3 on the title page with a date of 2/18/10. I have a downloaded copy that also has the version 2.3 dated 8/27/09. Not very reassuring.

  • Max Work Items Sizes for (8800/196): 512/512/64 for (480/197): 1024/1024/64. These values have changed across releases. I seem to remember that before the first public beta it was 1024/1024/1024. The question is: does this value vary by the GPU?

  • Max Work Group Size (512 to 1024).

  • Max Clock Frequency for (8800/196): 1350 MHz for (480/197): 810 MHz. Some of this is due to the card, but it must have been redefined from reporting the shader clock to the core clock, right?

  • Mem Base Addr Bit Align for (8800/196): 256 for (480/197): 512.

It might be helpful to developers is to document what deviceInfo values that are constant across the platform as apposed the the ones that actually describe the device.

  • Max Compute Units for (8800/196): 16 for (480/197):

is it the number of streaming proccessors? 15 for 480?

I understand your confusion. There are so many different terms that mean the same thing: shaders, streaming processors, & CUDA cores. OpenCL compute units are an aggregation of a # of processors. The question is if I would have upgraded to 197 as a separate step, would the 8800 now report 4, meaning the ratio had changed?

FYI, on OSX, apple reports a compute unit as equal to 1 processor for nVidia GPUs.

The whole CUDA cores thing isn’t exactly how the hardware works. The hardware is organized into what’s called SMs (streaming multiprocessor, I think?), which are the physical units that execute warps. Inside there you have individual execution units, and those units are what is meant by CUDA cores. From a programmer POV, the number of SMs is what you care about (because that’s the number that’s going to determine how much work can be resident on the chip at a time), not the number of CUDA cores.

One SM in the GTX 480 contains 32 CUDA cores, whereas in previous generations one SM had eight CUDA cores.

I regret that I now know what the 8800 reports under 197.45 (not 197.41 that’s exclusively for the 470/480). The answer is it reports almost the exact same thing as 196, except for a minor Global mem size difference. So it looks like Work item sizes are GPU dependent, and the clock frequency definition is not consistent across devices.

The reason I now know this is I just tried a lot of different configurations to confirm a bug in 197.41/45. I write OpenCL function modules which can be assembled either into a Q/A kernel or a final system kernel. Atomics are only used in some of the final system kernels, so all the Q/A kernels ran with the 8800 on 196. I have verified that a working 196.21/8800 Q/A kernel no longer compiles with either 197.41/480 or 197.45/8800 in the module portion.

Tomorrow I’ll isolate, submit bug report, maybe post to forum, and switch back to OSX. Sigh.

Well it is now tomorrow, the bug was actually in 196 & currently still in OSX (2/4/10 version). In one spot, I was using a float4 to return multiple values from a function, which I then referenced like ‘var[2]’ instead of ‘var.s2’. No compiler complaints until now.

That is easily fixed! I guess Apple is going to get the bug report. The first of my system kernels ran with a worksize of null. This kernel was one that could also run on the 8800. Uncalibrated, the 480 still beat the best time of the 8800 by almost 2x. This kernel does a whole lot of Image I/O, so I expect the calibrated time to be much better.

The second system kernel blows up on queuing with a CL_MEM_OBJECT_ALLOCATION_FAILURE, which I had not heard of till now, but it looks like a I am getting close!