Please help, project is making me desperate and angry

I need help with my CUDA project regarding implementation of AES algorithm. Now I am becoming depressed because of it.
Project is done, it is working properly but not as planned and I don’t know how to fix it. Problem is in time spent for execution. If someone could help I would be very grateful and I am even willing to pay for help because this is very important for me.
Let me explain more detailed.
There are two implementations, one for CPU and one for GPU. Both of them are working correctly but time spent on encryption in GPU implementation is taking too long. And I don’t know where is the problem.
Here is LINK to project

I would like to to know if my implementation is good. Does it meet rules and criteria for good parallel execution? What needs to be changed for faster execution, which memory to use for better performance etc. Because this is my first project in CUDA.

In both implementation there are files 1m.txt 2m.txt … 64m.txt with 1 million, 2 millions … 64 millions characters for encryption/decryption.
File 1m_info , 2m_info are files with informations about number of characters, time spent etc and files encrypted_text/decrypted_text are for verification on encryption/decryption.
Please help me, I am struggling for 3 months with this.
For any more information I am here to provide.

For someone doing academic work, you don’t seem to apply a lot of methodology to solving your problem.
From your previous postings I see that you’re implementing AES encryption in ECB mode.

What is your hardware you are benchmarking on (both the CPU and the GPU) ?

Are you sure that you’ve taken benchmarking results in a Release build of your projects? Debug builds can run orders of magnitude slower than a release build.

Have you attempted to run the CUDA profiler (nvvp or cudaprof or the builtin profiling capabilities of nVidia nSight) to see where the bottlenecks in the CUDA implementation are?

Are you aware of the typical PCI express speed limitations when copying large arrays of data? Are you including the time to perform required memory copies of the data in your runtime measurements, or did you just compare the runtime of the actual encryption kernel with the CPU implementation?

I am not professor or something like that :) If that is what you meant. This is my final project in college. Problem is that I is my first contact with parallel programming and methodology. Whole new concept and the way you need to think about code.

Both GPU and CPU are hardware for benchmarking.

You already helped me with something. Whole time I was running on Debug build. When I switched to Release build it is a lot faster. About 5x. Thanks for this.

I just used CUDA debug to see errors during implementation. Could you explain me how to use one of those profiles?

I am measuring just time of kernel execution, so time needed for copy to/from memory is not measured.

As you can see I am newbie to this, so what ever you can think of be free to tell me :)