Am new on GPU, I did try small program which adds two 256 x 256 matrices. The results was correct; however, the time using CUDA FORTRAN was much larger than the serial version. I was wondering what is the time required to set-up each kernal? I was thinking that the amount of computing not much to be paralleled using GPU.
Also, do u recommend a book or article talking about behind the scene of GPU?
P.S.: I used 256 blocks each block has 256 threads (i.e. <<<256,256>>>, also I tried to use DIM3 with various but it all yield slower than serial.