unexpected slow performance

I wrote a program (repo here https://github.com/Yagigami/cuda_learning) in CUDA to try to learn how to use my graphics card for heavy parallel computing. I have a GeForce GTX 1050 Ti (so it can theoretically reach up to around 66 GFLOP/s in double precision and 2 TFLOP/s in single).
With the program I wrote, i barely am around 5 MFLOP/s, and in fact my CPU is better for now (1.4s for the same input).
So I am asking what am I doing so wrong for my program to run so slowly?
I tried a sample (“maxtrixMul”) and it was already much faster (295 GFLOP/s in single precision) even though i did not notice any huge difference with my program, aside from a #pragma unroll which i tried to use with fine-tuning without success.
I also fine-tuned the blocksize and gridsize to get the most performance, but this is still very slow.
When testing, i also tried shutting down every other program but that did not help either.
With some testing I discovered that the slowing down happens in the loop inside partial_sum which should only use registers so I do not see how bad memory usage could lead to that.
I hope you can help with that!