How are the data aligned for the in-place CUFFT_R2C

I want to do the FFT to a real matrix ( float A[N1][N2] ) where both N1 and N2 are power of 2, and the output share the store with A, that is, in-place.
So, do I need allocate a memory with the size N1*(N2+2)? Shall I set A[i][N2] = A[i][N2+1] = 0 ?
And after the FFT, how can I convert the result which is a N1/2 (N2/2+1) complex matrix to the nomral style, that is , a N1N2 complex matrix?

Thank you very much.