Hello,
I’m not an engineer by training. I’m just someone who got obsessed with a question and couldn’t let it go. This post is the result of three months of research, a lot of dead ends, and something that might actually work. I’m posting it here because if anyone can tell me where I’m wrong — or where I’ve stumbled onto something real — it’s the people on this forum.
-–
Where This Started
I have a Samsung J7 Prime. It’s a budget phone from 2016. Mali-T830 GPU, 3GB RAM, Exynos 7870. By any reasonable measure, this phone has no business running PlayStation 4 games. A PS4 pushes 1.84 teraflops. The J7 Prime pushes maybe 20 gigaflops on a good day. That’s not a gap. That’s a different universe.
But I kept thinking about it. Everyone says mobile emulation is impossible because the hardware can’t keep up. The CPU has to translate x86 instructions to ARM in real time. The GPU has to somehow handle PS4-quality assets. It’s a non-starter.
Then one night I asked myself something really simple:
“What if the phone doesn’t have to do any of that? What if all the rendering already happened somewhere else, and the phone just plays a video?”
That question sent me down a rabbit hole I haven’t climbed out of yet.
-–
What I Found Along The Way
First Idea: Just Record Everything
My first thought was straightforward. Games are deterministic, right? Same inputs always produce the same outputs. So theoretically, you could record every possible playthrough of a game as video, store it somewhere, and stream the right one based on what the player does.
I was pretty excited about this for about two days.
Then I did the math. Even a simple game with 8 buttons over 10 minutes produces more possible input combinations than there are atoms in the universe. For GTA V, my conservative estimate came out to 181 exabytes of storage. That’s more than every data center on Earth combined. The whole idea collapsed immediately.
So I went back to the drawing board.
-–
Second Idea: Hybrid Video + Real-Time Rendering
Okay, I thought. What if we only record the most common paths? When the player does something unexpected, the phone’s GPU kicks in and renders that moment natively. Just for a second or two until it can jump back to the video.
I was proud of this one for about a week.
Then I actually looked at what “renders that moment natively” means on a J7 Prime. The Mali-T830 can’t even load PS4-quality textures into memory. It has 3GB of slow LPDDR3 RAM. A PS4 has 8GB of GDDR5 running at 176 GB/s. Asking this phone to render even one frame of Devil May Cry 5 is like asking a bicycle to tow a cruise ship. It doesn’t matter that it’s only for a second. It’s impossible.
That was a rough night. I almost gave up.
-–
Third Idea: Stop Recording Everything. Start Predicting.
This is where things finally started making sense.
I realized I was thinking about the problem wrong. I didn’t need to record every possible path through a game. I just needed to record enough paths that a prediction engine could guess where the player is going next.
Here’s the key insight that changed everything:
Games are state machines. At any given moment, a game is in exactly one state. Your position, the enemies’ positions, the camera angle, your health, your current animation — all of it can be represented as numbers. And given enough data, you can predict with real accuracy what a player will do next based on their current state.
Speedrunners do this constantly. They know exactly what frame to press what button because they’ve memorized the game’s state machine. If a speedrunner can predict the next input, so can a model.
So here’s what I designed:
1. Capture the game once, strategically. Instead of recording every possible input, record the game state at every frame and map which inputs lead to which next states.
2. Store it as independent video segments. Each segment is a few seconds of gameplay at 150 FPS, encoded in H.266/VVC with no dependencies on other segments. This means you can jump to any segment instantly.
3. Use a prediction engine on cloud servers. When a player is in a certain state, the engine looks at what real players (and speedrunners) do in that state and pre-loads the most likely next segments.
4. The phone does nothing but decode video and send inputs. No rendering. No prediction. No heavy lifting. Just playing a video stream and telling the server what buttons are being pressed.
-–
The Architecture (What I Actually Built On Paper)
I’m going to walk through this piece by piece. Please tear it apart where I’m wrong.
Capture Pipeline
The first problem is getting the game data. You need:
· A PS4 with exploitable firmware (for memory access)
· A capture card that can handle 150 FPS (Elgato 4K X or similar)
· A PC to run the encoding and state logging
· Custom software to automate the whole thing
The PS4 runs the game. A payload injected through the kernel exploit reads the game’s memory — player position, enemy positions, camera angle, health values, everything. This gets sampled 150 times per second and sent to the capture PC.
Meanwhile, an automation bot systematically plays through the game, trying different input combinations and recording what happens. It’s not trying every single possibility. It’s exploring the state space strategically, prioritizing paths that real players actually take.
Each recording becomes a video segment — 2 to 5 seconds of gameplay. The segment gets encoded in H.266/VVC with a key setting: IDR-only. That means every segment is fully self-contained. No frames reference frames in other segments. You can jump to any segment and start decoding instantly with zero delay.
The .thew File
This is the container format I designed. It’s basically a database of video segments indexed by game state.
The structure:
· Header: Game ID, version, frame rate, resolution, codec info
· State Table: Maps state hashes to segment IDs. O(1) lookup.
· Branch Table: For each segment, lists all possible next segments with probability weights
· Segment Data: The actual encoded video payloads with byte offsets and durations
When the prediction engine needs a segment, it queries the state table, gets the segment ID, and pulls the video data. The whole operation takes microseconds on server hardware.
The Prediction Engine
This is the brain of the whole system. It runs on edge servers close to the player.
The engine uses a Markov model — it predicts the next input based on the current state. But I added something that I think makes it work better for this specific use case.
I pre-seeded the probability tables with speedrun data.
Here’s why. In Devil May Cry 5, when you enter a boss fight, the “average” player might dodge randomly or spam basic attacks. But a speedrunner executes specific frame-perfect combos. If the prediction engine only knows average player behavior, it’ll constantly guess wrong during boss fights — exactly when latency matters most.
So I watched hours of world record runs. I mapped out exactly what inputs speedrunners use at exactly what moments. I fed that into the probability table as a prior. Now when a player enters a boss arena, the engine shifts toward speedrunner-informed predictions. It’s not perfect, but it’s dramatically better than guessing blind.
When the engine predicts correctly, the next segment is already buffered on the player’s phone. Zero delay.
When it predicts wrong — the player does something unexpected — the server has to fetch the correct segment on the fly. On a 5G connection with edge computing, this takes about 15-40 milliseconds. The phone freezes on the last good frame for a tiny moment, then resumes. It’s noticeable but not game-breaking.
The Streaming Protocol
I went back and forth on this a lot. Custom UDP? Raw TCP? WebRTC?
I ended up on QUIC (HTTP/3). Here’s why:
· Multiplexing without head-of-line blocking. Input and video travel on separate streams. A dropped video packet doesn’t delay the input stream.
· 0-RTT reconnection. If the player switches from Wi-Fi to 5G, the stream reconnects instantly.
· Connection migration. The session survives network changes without restarting.
· Forward error correction. Lost packets can be reconstructed without retransmission, which matters a lot for low-latency video.
I looked into building a custom UDP protocol with selective ACKs and my own congestion control. Then I realized QUIC already does all of this and has been battle-tested at Google scale. No need to be clever.
The Android Client
This is the part that runs on the J7 Prime. And it’s deceptively simple.
Kotlin handles the UI — game library, settings, the virtual controller overlay. Nothing fancy. Jetpack Compose because it’s clean and reactive.
But the real work happens in C++ through the NDK:
· Video decoding uses MediaCodec with the VVC extension. Hardware-accelerated. The J7 Prime’s chipset has a hardware HEVC decoder, and while VVC support is newer, the decoding pipeline is the same principle. The CPU isn’t touched.
· Input polling uses Linux evdev directly, bypassing Android’s input pipeline. This gets input latency down to under 1 millisecond.
· The frame buffer is a lock-free circular queue. Segments arrive, get decoded, and wait in the buffer until the exact frame they’re needed. The buffer holds about 15 seconds of video, which gives plenty of margin for network hiccups.
· Adaptive bitrate monitors buffer health and network conditions. If the connection weakens, it silently switches to a lower bitrate tier. The player might drop from 150 FPS to 120 FPS or 90 FPS temporarily, but the game never freezes entirely.
-–
Numbers That Matter (For DMC5 Specifically)
I ran the estimates for Devil May Cry 5’s Any% speedrun route. About 25 minutes of linear gameplay with combat branching.
Quality Tier Resolution FPS Storage
Tier 1 (5G mmWave / Wi-Fi 6E) 1080p 150 155 GB
Tier 2 (5G sub-6 / Fast Wi-Fi) 1080p 120 103 GB
Tier 3 (4G LTE) 1080p 90 62 GB
Tier 4 (Weak 4G) 720p 60 27 GB
Tier 5 (3G Fallback) 480p 60 14 GB
Total (all tiers) ~361 GB
That’s for one game. The full speedrun route with all predicted branches. 361 GB total, hosted on a CDN.
Streaming at Tier 3 (the most common for mobile users), the player needs about 18 Mbps sustained. Most 4G connections can handle that. At Tier 1, you need 45 Mbps — but at that point you’re getting 150 FPS on a 1080p screen with visual quality indistinguishable from local play.
For 10,000 concurrent players at Tier 3, monthly CDN egress would be around 58 petabytes. At standard volume pricing, that’s roughly $290,000 per month in bandwidth costs. Not cheap, but viable for a subscription service.
-–
Where I Need Help
This is why I’m posting here. I’ve been working on this alone for three months. I’ve read everything I can find. But I don’t have access to real cloud infrastructure or NVIDIA’s encoding expertise. Here’s what I genuinely need feedback on:
1. The encoding pipeline. H.266/VVC encoding at 150 FPS in real time is demanding. Are current hardware encoders up to this? Or does this need to be done offline and stored, which is what I’m assuming?
2. The prediction model. Markov models work but they’re simple. Would a lightweight transformer model running at the edge produce meaningfully better predictions? The latency budget for prediction is under 5 milliseconds — can transformers even operate in that window?
3. NVIDIA’s edge solutions. Does NVIDIA have anything in the EGX or Jetson lineup that could serve as prediction servers at CDN edge nodes? Something that can do fast state lookups and segment routing with sub-millisecond latency?
4. The VVC hardware decoding gap. How far away are we from VVC hardware decoding being standard on mid-range mobile chipsets? H.265 hardware decoding is everywhere. VVC is still rolling out. Should I target H.265 instead and take the storage hit?
5. The obvious question I’m missing. I’ve been staring at this for three months. There’s probably something fundamentally wrong that I can’t see anymore. If you spot it, please tell me. I’d rather know now than waste more time.
-–
Final Thoughts
I know this is ambitious. I know I’m not a cloud gaming engineer. I’m just someone who couldn’t stop thinking about a problem and followed it wherever it led.
Maybe the whole thing falls apart under real scrutiny. Maybe there’s a reason nobody has built this yet. But I had to find out.
If you read this far, genuinely thank you. Any feedback — positive or negative — is more than I’ve gotten in three months of working on this alone.
-–
P.S. — I have a more detailed whitepaper with state hashing pseudocode, capture bot logic, and buffer management diagrams. Happy to share if anyone wants to dig deeper.
Also,I just want the footages and more time,here it is,boom!