Transferring data to GPU vs Calculation on GPU


I want to ask a simple question about which I am confused. I am writing a program which takes characters as input(length may be 100000) and makes a matrix of 8x8 for each character. It requires 64x4x100000 bytes = 25 MB(approax) memory space and also this amount of data is transferred from host to device. But if I transfer character sequence to GPU and make matrix of desired size there and perform calculation on matrices, I need much less memory space but I am confused about storage of new matrix created in kernel because I cannot allocate dynamic memory to matrix. Any suggestions about this???

Best Regards,

Since you know the size beforehand, just allocate the memory from the CPU side using cudaMalloc(), and pass the pointer to the kernel.