cuFFT and FFTW ciscrepancies

I suspect this question in answered somewhere, but I have searched and cannot find it.
I am using cuFFt 4.0. I am getting discrepancies between fftw and cufft calls. I created a data set and set up 3D in-place forward and reverse double precision fft using both fftw and cuFFT. I then compared the results. Most of the cuFFT data entries were within 10E-9 of the fftw entries. However, several entries were not within the 10E-9 tolerance. The problem goes away when I switch to compatibility mode FFTW_ALL.
I believe I have the data layout correctly set up. There is space for ½ N +1 complex entries for both the cufft and fftw transforms.

For the inplace transforms it is important to pad the matrix correct. The padded matrix must real of size [lx][ly][lz+2] with the elements corresponding to z>=lz not used in the real space. A real matrix of size [lx][ly][lz+2] takes te same amount of memory asa complex of [lx][ly][lz/2+1].