Cuda programming Problem

Hi… i am doing thesis in compressive sensing… i want to run my code on GPU… so for this ,i am doing programming in Cuda…
In kernel , i want to do something like
kernel input->phi(256,512),signal(512,1)
kernel output->result(512,1)
kernel processing
std=standard deviation of y
//some processing after which result is stored in result(512,1)
Here Phi is constant array of dimentions 256
Now tell me how many threads and block i should create and how can i handle y and std and other temporary variables?

You could use the CUBLAS library, look at the GEMV function (DGEMV for double precision, SGEMV for single).

have you any idea about thrust in cuda?