The fastest copies are those that are avoided. In my experience, any time the question of “fastest bulk copy” comes up in the context of performance tuning it is a read flag.
Physically, from a hardware perspective, the fastest copies are those that use vector loads and stores. The widest of these are 128 bits (16 bytes) at this time. That corresponds to CUDA’s uint4 type for example. Note that in GPUs, all loads and stores must be naturally aligned otherwise their behavior is undefined. That is, a 16-byte access must be to an address that is divisible by 16 without remainder.
This does not fall out of your current structure definition, and the alignment requirement means you cannot simply cast pointers to a type with stricter alignment. Since no context was provided, you will need to figure out the best way to ensure alignment. FWIW, conventional wisdom with regard to performance tuning suggests that it is usually best to sort structure members in order of decreasing element type size, while the opposite was done here (assuming that WORD is a wider type than BYTE, which seems like a reasonable assumption).
As for loop unrolling, that is something the CUDA compiler pursues aggressively by itself. You can intervene manually with the help of #pragma unroll if need be.
If you continue with your current approach, and the data in both cases is in global memory, then a suggestion is to make sure you have coalesced behavior (loads and stores) across threads. This affects your indexing and data storage patterns. There are numerous question about coalescing on various forums. If you search you will find some. In a nutshell, you want adjacent threads in the warp to read (or write) adjacent locations in global memory.