How to return object in device function?

If I want to return an object from device function as in the case below, an assignment operator will be called to make a deep copy from A to A2. Is there any way on CUDA like the rvalue reference mechanism in C++ to avoid this?

__host__ __device__
NRmatrix<T> trans(const NRmatrix<T>& A){
	NRmatrix<T> A;
	return A;

template<typename T>
__global__ void RunGLS_OnGPU(NRmatrix<T> const& A,some arguments){
NRmatrix<T> A2=trans(A);
//or something like
NRmatrix<T> &&A2=trans(A);

I’d be mildly surprised if whatever you’re wanting to do in C++ doesn’t work in CUDA device code.

For example, && as a c++11 rvalue-reference should work fine in CUDA device code.

Thank you. But when I use rvalue reference in CUDA device code it takes 56.8 seconds. Without rvalue, it takes 56.651 seconds. Using rvalue reference is a bit more slowly.For my cpu version of the code, rvalue reference can make around 4 seconds faster.