Leveraging RTX hardware capabilities with OptiX 7.0

That is correct.
The RT cores inside the RTX Turing boards accelerate two parts of ray tracing, the BVH traversal and the triangle intersection.

Both will run fully on the RT cores (in contrast to running on the Streaming Multiprocessors (SM)) when the BVH hierachy has two levels. Means a maximum of two acceleration structures (AS) from root node to leaf triangle geometry, which means for the scene structure an Instance AS (IAS) and Geometry AS (GAS) with triangles. The transform in the instances is hardware accelerated.

That’s basically all, but OptiX also supports only one GAS and also multiple IAS levels. The latter will run the BVH traversal only partially on the RT cores. (The overall maximum traversal depth can be queried via OPTIX_DEVICE_PROPERTY_LIMIT_MAX_TRAVERSABLE_GRAPH_DEPTH.)
Also whenever there is motion-blur inside the scene in the instances or geometry, the BVH traversal will become more expensive.
For custom geometric primitives, only the BVH traversal will be hardware accelerated but then calls back into your intersection shaders running on the SMs.
There are other things which call back into the SMs like anyhit programs.
Have a closer look at the available optixTrace() flags which can control some of the program domain invocations.

I’d recommend to watch the “OptiX Performance Tools and Tricks” presentation for more information:
https://devtalk.nvidia.com/default/topic/1062216/optix/optix-talks-from-siggraph-2019/

Not really. It’s happening automatically if you’re building the scene according to the above structure.

Mind that a GTX 960 is an entry-level board of a three GPU generations older architecture. It will be far from representative of the possible performance of even the smallest RTX board in everything you throw at it.