can you help me measring the run time, memory time

hey,
i want to measure some times on my algorithm , can you help me how to do that
first i have to measure the time of the total processing time , that is an easy one,
but i have to measure the runtime of the kernel without the memory transfer
can you help me doing that , note that as i have knew that the kernel call is asyn.
so this code will >>NOT<< work :

// create and start timer
unsigned int timer = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));

//kernel call
TestKernel <<<gd , bd>>> ();

// stop and destroy timer
CUT_SAFE_CALL(cutStopTimer(timer));
printf(“Processing time: %f (ms) \n”, cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));

also i have to measure the transfer time only , can you provide me some information how to do that ?

async processing example