How to know where the bottleneck is?

How can I now if my kernels expend most of their time accessing global memory or performing computations? I have tried the CUDA Profiler but it just provides information about the time expent by each kernel…

Do some maths: if your reads are coalesced you can achieve 70 GiB/sec. If your kernel shows close figures — memory is the bottleneck.
Alternatively, you can add some computations and check if running time of your kernel changes.

Ok, so let’s say for example, that my kernel follows the next process:

-Read X1 MB of data.
-Process X1 MB of data.
-Write X1 MB of data.
-Read X2 MB of data.
-Process X2 MB of data.
-Write X2 MB of data.

If all my accesses to global memory are coalesced and my card provides a memory bandwidth of BW (in MB per second), I guess that the time spent accessing the memory by each kernel invocation could be estimated by the next formula:

T = (2·X1 + 2·X2)/BW

Am I right? Also, are the constant memory accesses also faster if they are coalesced?
I supposse that those 70 GB/sec are device dependent, right? I have a GeForce 8600M GT which, as far as I know (Wikipedia) can give up to 12.8 or 22.4 GB/sec, but not 70GB/sec.

Constant memory access is fast if all threads in warp access same memory location. In other cases using textures may be a better option.

Yes, 8600M have much smaller bandwidth. Take a look at bandwidthTest SDK sample which will show you actual bandwidth for your device.