cufft r2c "bug"? problems with in-place r2c 1d fft

I’m getting wrong results when trying to perform an in-place r2c fft on multiple signals (using BATCH).
the out-of-place fft comes out identical to the fftw results. the in-place fft comes out correct only for the first signal of the batch, but afterwards it’s completely different (not an offset in memory, I’ve checked…).
here’s a piece of the code:

#define FFT_SIZE 1024
#define ACTUAL_FFT_SIZE (FFT_SIZE/2+1) // R2C fft outputs only N/2+1 non-redundant values

unsigned int sig_mem_size = sizeof(float)*(ACTUAL_FFT_SIZE*2)*N;// for in-place fft, N the number of signals
unsigned int fft_mem_size = sizeof(Complex)*(ACTUAL_FFT_SIZE)*N;// actuall same as sig_mem_size

cufftReal* d_signal ;
cutilSafeCall(cudaMalloc((void**)&d_signal, sig_mem_size));
cufftComplex* d_ffted_signal ;
cutilSafeCall(cudaMalloc((void**)&d_ffted_signal, fft_mem_size));
cufftComplex* h_ffted_signal = (cufftComplex*)malloc(fft_mem_size);

// create signal on host & copy to d_signal
cufftHandle plan;
cufftSafeCall(cufftPlan1d(&plan,FFT_SIZE , CUFFT_R2C, N));

cufftSafeCall(cufftExecR2C(plan, (cufftReal *)d_signal, (cufftComplex *)d_ffted_signal));// out-of-place FFT 
cufftSafeCall(cufftExecR2C(plan, (cufftReal *)d_signal, (cufftComplex *)d_signal));// in-place FFT 

//copy out of place result to host and dump to file
cutilSafeCall(cudaMemcpy(h_ffted_signal, d_ffted_signal, fft_mem_size, cudaMemcpyDeviceToHost));

//copy in-place result to host and dump to file
cutilSafeCall(cudaMemcpy(h_ffted_signal, (cufftComplex *)d_signal, fft_mem_size, cudaMemcpyDeviceToHost));

/// dumpSigToFile transformed the ffted vector back to a matrix of signals for easier reading

Please help…