Sparse matrix operations inside cuda Kernel

I need to multiply sparse matrices inside the cuda kernel. Each thread would do one sparse matrix operation. I know that the API of cusparse allows to perform sparse operations on device side, but not inside the kernel.
Eigen integrates some functionalities inside the Kernel but not sparse operations.
I tried a bit with thrust using the index of non zeros elements in the matrix, but it wasn´t straightforward to me. Are there any libray that handles sparse matrices in the kernel? Or would it be possible to use thrust to this end?

That seems to be a very unusual approach. How big is each matrix?

this is about 1k in dimensions but they are sparse.

What exactly do you mean by “one sparse matrix operation”? Presumably you do not envision each thread handling one 1K x 1K matrix.

Thus my whole point, the matrices are sparse so vector-matrix multiplication is actually a few operations and can be handeled by a thread.
I am going to share the algorithm asap to better explain myself. Thanks for helping njuffa.

I think I am not helping yet, but merely collecting additional information that may enable others to point you in a reasonable direction.

Purely to generate a performance reference point (lower bound), it might be useful to use CUBLAS’s dense matrix operations to handle the 1Kx1K matrices. CUBLAS is the only non-template library available inside device code, best I know.