Data being different in the device when copied from the host

i am facing multiple issues when dealing with large number of bits, the operation that I am doing is encoding as per DVBS2x standard.
1.When I copy an array of bits of length(64800) into the device memory from the host using cudaMemcpy(), the data in the device is different from what I am passing from the host.

2. The same code, without any edit and without any thread dependencies is giving the same output for the same set of input. What could be the reason?

3. Is there a limit in the number of print statements from a kernel. I am launching 32400 threads(in suitable combinations of grids and blocks) and there is a print statement in the kernel, but instead of 32400 print statements, I am getting less number of print outputs.
  1. yes, there is a limit to the amount of printf output from a kernel