I am very new to CUDA. I am trying to write a dot product kernel call but i am not sure how to start.
I have a float2 arrayA of size [ROWS * COLS], a float2 arrayB of size [COLS], and an output float2 arrayC of size [ROWS].
For every row of arrayA, I need to get the dot product with arrayB and store the results into arrayC. So, in total i will do dot product for ROWS times and will have ROWS answers.
Can anyone guide me on how to achieve this?