Latency is introduced not a file-level but at i/o operation level. When you request to read (and decrypt) some data, first block (usually 512 bytes) is read and decrypted. Decryption can be overlapped with reading next block, but time to serve first block to client is increased by time needed to decrypt it. If you read sufficiently large block of data then this is not an issue. But in reality such continious reading is rare. Most files are small or fragmented meaning that you need more than one read to get their content, and you will face same latency penalty at each read.
Try to measure what will be host <-> CUDA device bandwidth when using 512-byte chunks (typical for storage devices); I suspect it will be far from optimal.
The solution might be a CPU+GPU operation when CPU serves small i/o request (which are the majority) and GPU server only large reads/writes…
This only shows that particular implementation is poor; just google for AES benchmarks and you’ll see that 50+ MB/sec per core is not a problem today.
BTW, I’ve been using whole-disk encryption for years and haven’t noticed any significant performance penalties.
I’ll just remind you that AES-256 is approved for TOP SECRET data. Cascading ciphers may seem like a good idea, but it does not add practical security (it adds overhead and protects only from case when outer cipher is broken completely which is very unlikely…).
Filesystem encryption using CUDA is certainly possible, but I’m just trying to tell you that it is very very likely that it will not be better(=faster) than existing CPU-based solutions due to the problems mentioned before (also note that you can’t use CUDA from kernel space which adds another layer of overhead).
IMO CUDA is perfect for things like initial encryption of HDD, but it is not very good for typical daily work with filesystem.