Local_work_size on NVidia drivers


I noticed in several forums on the net, that NVidia OpenCL drivers require the user to set the Local_work_size parameter passed to the clEnqueNDRangeKernel. If left out, Nvidia will set this value to 1, thus causing suboptimal performance (by a lot). I was wondering if this is still the case or has that been improved? Open CL implementations from both AMD and Intel select good values (multiples of 32 or 64) and Intel even recommends leaving this parameter undefined (determined by the driver). How NVidia handles this affects the (performance) portability of the kernel.