template matching of small 3D Matrix in 300x300x4000 3d cube

the title says it all. I have huge data cubes of several GB (which I could cut into subcubes) in which I search for matches with a much smaller template (maybe 30x30x30 or such).
I am new to CUDA and am testing on an old card with capabilities 1.1
(if this works, I can easily upgrade).
Are there any known libs or even better: examples of such 3d template matching / 3d cross correlation?

Any help is highly appreciated.