As far as i know CUDA does not allow to allocate a device buffer in one host thread and transfer memory from or to this buffer in a different host thread. Does this restriction also apply to OpenCL? The standard only states that this is not thread-safe which in my opinion means that the programmer has to care about synchronization himself.
Thread-shared buffers are available in CUDA since 2.2 (or 2.3?) IIRC.