SIFT with CUDA

Good afternoon. I am trying to implement SIFT (https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html) detector with CUDA. I am aware that there is an open-source implementation on Github, I am doing this for learning purposes.

My code is running, but I am having doubt about some choices I have made for the implementation, and I would be grateful if somebody more advanced in CUDA would give some a practical advice (understanding how SIFT works would probably be required to answer those questions). I am very thankful in advance :)

  1. I am not using textures in CUDA. I have read about them, but would it be really beneficial for this application? I guess I could make an input image as a read-only texture. But the other arrays (i.e. scale-space pyramid of SIFT) would be much bigger than the input image, while not being read-only. So would making only the input image as read-only help performance? Should I “transform” my arrays to textures once they are filled and will stay read-only for the rest of the program?

  2. I am following an article for threads allocation - https://www.researchgate.net/publication/269302930_Parallelization_and_Optimization_of_SIFT_on_GPU_Using_CUDA. Does it look good (looking at the images labelled as “threads allocating” is enough to answer this question, no need to read the whole article).

  3. That article does not say anything about using scale-space octaves, so it just allocates threads for a single level in a single octave at the time. What would be the better way of adding octaves:
    a) Every level is in a separate thread?
    b) The whole octave is processed in one thread & kernel is called sequentially for each octave? (This is how I have it now.)
    c) All the scale-space is processed in the same thread?
    d) Some other option I cannot come up with?

Thank you for reading.