Making a Mex file from CUDA code

I’m trying to generate a MEX file from a function i wrote in CUDA. it compiles well, but once i run it in Matlab, my Matlab crashes (it gets closed).

here is my Kernel:
global void kernel_Reconstruction2(Setup* SetupLoaded_p, float* MediumZ_p, float* MediumX_p, float* TRansducerCorrZ_p, float* TRansducerCorrX_p
, int* RfData, int* Dir, int Dir_Size, int Reconstruct_SoundSpeed, int* ReconstructedImage_GPU, int transmit, int NStart_Transmit, int size, float* Device_ConvArrivalTime) {

int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
int GID = RowOFFset + BlockOFFset + TID;
int GID_RowBased = BlockOFFset + TID;
int D1, D2, sam, Pz_man, Px_man, receive, RoundTripSample, IndexingReceive, IndexingTransmit;
float ReceiveTime, RoundTripTime, TransmitTime;
if (GID_RowBased < size) {
	Px_man = (GID_RowBased) % (SetupLoaded_p->Nx);
	Pz_man = (GID_RowBased) / (SetupLoaded_p->Nx);
	receive = blockIdx.y;
	IndexingReceive = receive * Dir_Size + (GID_RowBased);
	IndexingTransmit = transmit * Dir_Size + (GID_RowBased);

	TransmitTime = (sqrtf(((TRansducerCorrX_p[transmit] - MediumX_p[Px_man]) * (TRansducerCorrX_p[transmit] - MediumX_p[Px_man])) + ((TRansducerCorrZ_p[transmit] - MediumZ_p[Pz_man]) * (TRansducerCorrZ_p[transmit] - MediumZ_p[Pz_man])))) / Reconstruct_SoundSpeed;
	ReceiveTime = (sqrtf(((TRansducerCorrX_p[receive] - MediumX_p[Px_man]) * (TRansducerCorrX_p[receive] - MediumX_p[Px_man])) + ((TRansducerCorrZ_p[receive] - MediumZ_p[Pz_man]) * (TRansducerCorrZ_p[receive] - MediumZ_p[Pz_man])))) / Reconstruct_SoundSpeed;
	RoundTripTime = (TransmitTime + ReceiveTime);
	RoundTripTime += (SetupLoaded_p->TransmissionOffset);
	RoundTripSample = lroundf(RoundTripTime * SetupLoaded_p->Fs) - 1;

	ReconstructedImage_GPU[GID_RowBased] += ((RfData[RoundTripSample + ((receive)*SetupLoaded_p->NumberOfSamples)])
		* (Dir[IndexingReceive] * Dir[IndexingTransmit])); //  (Dir[Pz_man * SetupLoaded_p->Nx + Px_man])



I have already made a project in Visual Studio and can confirm that this kernel works fine as it should. In the "Configuration Properties ">> “CUDA C/C++” >>“Command Line”>>“Additional Options”, I have added -Xptxas -dlcm=cg -use_fast_math to turn off the L1 cache and use the fast math option of CUDA.

All also works fine in the MEX gateway I have written if i do not add (run) the Kernel in this gateway. In Matlab, I use “mexcuda -v COMPFLAGS=’$COMPFLAGS -use_fast_math -cudart static’” to compile the gateway into MEX. Once the Kernel is added to the gateway, compiled to MEX, and ran in Matlab, as i said, the matlab gets closed.
I guess the problem is with the compile code i use in Matlab. should it be different than what it is now? Please help.