New to CUDA, need some advice on parallelizing this Julia Set renderer.

Im a long-time developer and have just begun programming for my GPU, I think a nice Julia Set renderer would be a great exercise. My program does work, however I have seen videos on YouTube from years ago about people who have used CUDA to render Mandlebrot Sets, at a higher resolution - at more iterations and at a stabler frame rate.
The parallelization method I use in this program is:

  1. -> Create a thread for each Y pixel.
  2. -> Loop for each X pixel
  3. -> Complex number magic on those numbers until either the iteration limit is reached
    or until it diverges
  4. -> Put iteration into the shared memory array
    The program is definately faster than on the GPU.

Link to source code (C++)
Can somebody give me some advice on how to speed up my algorithm?