Mersenne Twister in C++/CUDA Within a Single Function

Due to the limited functionality of MATLAB CUDA kernel objects, I am in a situation where I need to create pRNGs (Mersenne Twister for maximum quality) random numbers without the use of special input types or multiple functions.

Requirement of this function:
This function has the ability to, in a 500000 iteration “for” loop, generate a new random decimal number in the interval [0,1] at every iteration.

So within a single CUDA function i.e

__global__ myFunction()
{
int threadsPerBlock = blockDim.x * blockDim.y;
int blockId = blockIdx.x + (blockIdx.y * gridDim.x);
int threadId = threadIdx.x + (threadIdx.y * blockDim.x); 
int globalIdx = (blockId * threadsPerBlock) + threadId;
///code here///
}

Is such a feat possible? Does the code for this already exist?

Thanks so much and forgive me for my ignorance of the Mersenne Twister implementation. I tried to answer this question myself but could not understand how Mersenne Twister is implemented.

Also, I don’t know if it matters, but I need a solution for CUDA 6.0.