Firstly I apologize if this is a very basic question, but I am very new to CUDA, so asking this. I would like to achieve this:

- Load a matrix M of size m.w from CPU to GPU
- Load a vector W of size w from CPU to GPU
- Multiply M and W, which will result in a new vector V of size m (all in GPU)
- Calculate maximum scalar value X from vector V (all in GPU) and then copy scalar X from GPU to CPU

In the above

- m can be of size 2 million to 10 million
- w can be of size 100 to 10,000.
- Input Vector W is made of floating point numbers, between 0.0 and 1.0.
- Matrix M, Vector V, and Scalar X output are all made of positive or negative floating point numbers