cuda matrix multiplication with different sizes

Hi… I want to make a kernel in which multiplication is like
c(256,1)=a(256,512).*b(512,1)
how can I be possible?
how many threads in a block?