Is there a good algorithm for doing batched inner product?

I have to compute the inner product <X, Q.X> = X^t.Q.X where X is vector of size N and Q a square matrix of size NxN. N can quite large (maximum value: 32,000)

I have to do it for a large number of state vectors X (a few thousands).

This not standard matrix multiplication. Is there code or algorithms that I can use in order to speed up this calculation with CUDA?