solving Ux = b (U is upper triangular)

How to solve such Ux = b (small enough to reside in shared memory) efficiently as calculating one element in x depends on all following ones?

The cuBLAS library provides such routines as cublasStbsv() or cublasStpsv(), but I think they were designed for large ones.
In my problem, x might be sized as 64 so that developing own kernels will take further advantage of shared memory.