Calculate the time taken to run an algorithm on GPU

I want to calculate the total time taken for a fixed code run using an NVIDIA GPU (for instance, Tesla K40). The code has to run 1 million single-bit comparisons. All the comparisons are independent of each other. There is no memory access involved during the execution of the code since the complete data (2 bits per core) are stored in the local registers of the cores at the time of execution. So effectively, the code can run x number of comparison operations in parallel in a single instruction cycle (where x is the maximum number of parallel comparisons that can be executed on the mentioned GPU, I am not sure this will be the number of cores or the number of threads). I want to understand the process of calculating the following in this problem, using the datasheet of NVIDA Tesla K40:

  1. x
  2. time taken for each instruction cycle (for a single-bit comparison)
  3. Total time taken for 1 million single-bit comparisons