Malloc in Kernel Complexity (10.2)

When calling __device__ malloc(), what sort of bottlenecks can effect the runtime?
Based on some quick experiments in which each thread allocates an int, the runtime tends to be fairly constant for each block (around 0.5 milliseconds), but increases linearly with the grid size.
Is there anywhere I can find information on what malloc() is doing under the hood? How parallel is it capable of being? How does it try to avoid conflicts, and at what point do conflicts become inevitable?
The programming guide mentions that device malloc can be used, but it doesn’t provide any insight into when it would or wouldn’t be a good idea.

I don’t think you will find published “under the hood” implementation details anywhere. If you’d like to see a change to the nvidia documentation, my suggestion is to file a bug.