Some questions about ray

First of all, why is your current code stepping along a fixed ray.origin.y increment of 1.0 units? (Also avoid doubles, use 1.0f instead.)
Stepping with a fixed increment doesn’t make sense when you want to gather all intersections along a ray and that number can vary.

After shooting the primary ray with a zero offset, the continuation rays after a triangle intersection should step with the intersection distance plus some scene dependent epsilon to avoid self-intersections with the hit triangle.
For that you return the positive floating point intersection distance you query with optixGetRayTmax() inside the closesthit program inside the payload register and return -1.0f for miss.
That would automatically skip all empty space between triangle intersections along the ray direction and only shoot as many rays as there are intersections along that ray, until you reach the miss program.
Something like this:

const float intersectionDistance = __uint_as_float(p0);
if (intersectionDistance >= 0.0f) // Hit?
{
  ray_origin.y += intersectionDistance + sceneEpsilon; // Skip empty space and the hit triangle.
  // For arbitrary directions actually: ray_origin += ray_direction * (intersectionDistance + sceneEpsilon);
  // This might need some vector operator overloads you can find for example in https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo10/shaders/vector_math.h
  ++num;
}
else // Negative intersectionDistance means miss program reached: End of path.
{
  break;
}

You could also just keep the ray orgin and direction intact and only increase the ray.tmin value and keep the ray.tmax the same. Just make sure ray.tmin is always less than ray.tmax, otherwise you’ll get invalid ray exceptions.
This would generate smaller ray intervals with each step which should speed up the BVH traversal.

Also that get_neighbours() function is called by your __raygeneration__ program?
Then it should be put into the same module as that and not use extern "C".
__forceinline__ __device__ void get_neighbors would be enough then.

Please get this working perfectly before trying other solutions.

The BVH traversal and ray-triangle intersection would be fully hardware accelerated by the RT cores on RTX boards. Current high-end boards can reach well over 10 GRays/second doing that, which would normally only be achieved with comparably simple scenes and user device programs which do very few memory accesses.
The maximum rays/second performance is usually limited my the memory bandwidth on RTX boards, so you should first try if this iterative approach is good enough when done correctly.

As explained above,
The other device function which influence what happens with the current ray traversal is optixIgnoreIntersection. That causes the current potential intersection to be discarded, so t_max is not changed and the ray traversal continues.

That means if you want to gather all intersections along a ray inside an anyhit program, you would need to store the intersection data you need inside the anyhit program and call optixIgnoreIntersection at the end.
That way the closest hit program will never be called and the ray ends when reaching the miss program.

The problem with this approach is, that the anyhit program will interrupt the hardware BVH travseral on RTX boards and call back into your anyhit program for each potential intersection.
Additionally intersection and anyhit programs are not called in ray direction but in BVH traversal order, so you would need to sort your gathered interersections by intersection distance if you need them in ray direction order
The iterative approach above could store each hit into a global buffer inside the raygeneration program.

If you actually need the intersection information (distance, triangle ID, etc) and not only the number of intersections along the ray, then you would need to have room for all intersection results per ray.
That’s either a buffer allocated a-priori and written for each intersection inside the ray generation with that iterative approach above, or some per ray payload structure which is provided as 64-bit pointer in two 32-bit payload register. Both might result in rather large memory requirements depending on your launch dimensions and number of maximum possible intersections per ray.

This has been discussed multiple times on this forum already.
Please have a look especially at these threads:
https://forums.developer.nvidia.com/t/ray-mesh-intersections/170090/5
https://forums.developer.nvidia.com/t/whats-your-solution-to-get-all-hit-primitives-of-multiple-rays/239528/2
and at all forum threads found by the search link in this post:
https://forums.developer.nvidia.com/t/can-i-not-enter-the-miss-shader/246130/2