New to CUDA, need some advice on parallelizing this Julia Set renderer.

Im a long-time developer and have just begun programming for my GPU, I think a nice Julia Set renderer would be a great exercise. My program does work, however I have seen videos on YouTube from years ago about people who have used CUDA to render Mandlebrot Sets, at a higher resolution - at more iterations and at a stabler frame rate.
The parallelization method I use in this program is:

  1. Create a thread for each Y pixel.
  2. Loop for each X pixel
  3. Complex number magic on those numbers until either the iteration limit is reached
    or until it diverges
  4. Put iteration into the shared memory array
    The program is definately faster than on the GPU.

Link to source code (C++) #include "cuda_runtime.h"#include "device_launch_parameters.h"#include <stdi - Pastebin.com
Can somebody give me some advice on how to speed up my algorithm?