I am implementing a layer for depth concatenation i.e. concatenating along multiple inputs for format NCHW into along the C dimension. Is there a way to do this using cuDNN library ? I dont find any API that does concatenation like this. I wrote my own cuda kernel to do this for two inputs which works. But I am looking for a faster implementation for this ? Has anyone used existing cuDNN api’s cleverly to do the same ?
/* Device code /
void global concat2Impl(float in1,
size_t numElems1, // chw for input1
float* in2,
size_t numElems2, // chw for input2
float* out,
size_t maxElems, // nchw for output
int numElemsPerBatch // chw for output
)
{
/ Calculate the global linear index, assuming a 1-d grid. /
size_t i = blockDim.x * blockIdx.x + threadIdx.x;
for (; i < maxElems; i += size_t(blockDim.xgridDim.x))
{
// compute batchIndex and batchOffset
size_t batchIdx = i/numElemsPerBatch;
int batchOffset = i - batchIdxnumElemsPerBatch;
out[i] = (batchOffset < numElems1) ? in1[batchOffset + batchIdxnumElems1] :
in2[(batchOffset-numElems1) + batchIdx*numElems2];
}
}