the memory using about histogram

I want to implement a histogram with 256bins, but the shared memory is not enough to use if I set the number of thread more than 64, which is not a efficient executing model. of course, use the global memory is simple, no need to consider the shared memory conflict, no need to consider the memory access conflict, but it is much slower. so can any one give me some advices about that?

thanks for any reply.

best regards.