concatenation of arrays from multiple devices

I am looking to farm out portions of work onto multiple devices to produce sub-arrays and then gather these sub-arrays produced on each of the devices into one large array which I will then broadcast to each device so each device has its own copy of the large array.

I know I can do this by using a counter on each device to record how many elements it has in its sub-array and then on the host, after cudamemcpy from device to host, I can then concatenate the arrays by adding them one array at a time to the large array.

But is there a function that can do this for me in CUDA? or a more succinct method of doing this?

If I understand correctly, you want functionality long the lines of MPI_Allgather? You’ll have to code this up yourself as CUDA doesn’t provide inter-GPU communication functions.