Number of Threads

I implemented my kernel with 64 and 256 threads in each block respectively. The result showed that the runtime with 64 threads was quicker than that with 256 threads. I was wondering whether someone could give me a detailed explanation to this phenomenon.
Thanks a lot.