Accelerating AI Inference for 3D Creation on Roblox
-
Roblox implemented CUDA Graphs and KV caching to speed 3D mesh generation for more responsive iteration.
-
At launch, the Cube 3D model could generate tokens in 7.8 milliseconds (down from 60.5 milliseconds) and full objects in 4 seconds (down from 31 seconds).
Earlier this year, Roblox shared the first capability of our Cube 3D foundation model. With Cube 3D, creators can generate 3D models and environments directly from text prompts. From the outset, we prioritized latency optimization, recognizing that slow generation times disrupt what’s inherently an iterative process. Before Cube 3D launched in March, we’d already made the inference step for Cube 3D 87% faster and more responsive for both developers and users.
Since launch, more than 578,000 objects have been generated across several notable experiences. Developers also expressed interest in enabling users to generate 3D objects within experiences with text prompts such as “cats”, “burgers,” etc. Most notably Mic Up, a popular hangout game using voice chat, leveraged Cube 3D to give players a fun and interactive way to generate objects. In their implementation, players can open up a left-hand menu of additional capabilities, including an AI icon. After clicking that icon, players can enter a text prompt to generate a 3D object. For users, longer generation times create friction, depriving them of the magic of seeing their ideas transformed into 3D in real time.
We wanted to transform the 3D generation experience from a stop-and-wait interaction to something that feels responsive and natural, enabling rapid experimentation. The ability to quickly add objects to a scene is critical for developers. To accelerate Cube 3D, we first profiled the inference pipeline to identify performance bottlenecks. Despite using powerful GPUs, we found significant idle time between operations.
Solving the CPU-GPU Scheduling Bottleneck
Modern deep learning frameworks rely on the CPU to schedule and launch operations (or kernels) on the GPU. The CPU prepares each operation, sends it to the GPU, and awaits confirmation before preparing the next operation. This waiting creates a scheduling bottleneck where the GPU could sit idle while the CPU prepares the next batch of work. Ideally, we want the CPU to run ahead of the GPU, preparing and queuing operations so that the GPU always has work to do.
This is especially problematic for autoregressive decoders in transformer-type models like Cube 3D, which need to process input and generate tokens sequentially. These models require thousands of individual operations for a single generation, and the computational overhead accumulates with each step in the sequence.
“We wanted to build something that enables four-dimensional interaction,” said Vice President of Engineering Anupam Singh, explaining why Roblox selected an autoregressive approach. “We don’t just want to build the car; we also want to be able to open the door of the car and get inside it.”
Each operation incurred:
-
CPU time to prepare each kernel
-
Overhead from launching the kernel
-
GPU execution time (the actual computation)
-
Synchronization overhead when checking for completion
In the case of small operations that execute quickly on the GPU, this overhead can dominate the inference time. The GPU could be actively computing for only a small fraction of the total inference time.
Implementing CUDA Graphs: Eliminating the Middleman
To address this bottleneck, we leveraged CUDA Graphs—a feature that allows for recording and replaying sequences of GPU operations without CPU intervention. The autoregressive decoder component of Cube 3D’s architecture processes text prompts and generates shape tokens through a fixed-length vector.
While functionally similar to a traditional large language model (LLM), our dual-stream decoder architecture has an important difference—it uses two parallel attention streams. One stream is dedicated to condition tokens and the other to shape tokens. Off-the-shelf LLM inference engines weren’t suitable for our needs, and we needed a custom implementation tailored to our specific architecture.
Think of CUDA Graphs like recording a macro for the GPU. Instead of the CPU issuing each command individually, it records an entire sequence of GPU operations (the graph) and launches the entire graph with a single CPU instruction. This approach dramatically reduces kernel launch overhead by eliminating the need for the CPU to individually schedule each operation during inference. Once the graph is launched, the GPU executes the entire sequence autonomously, without waiting for further instructions.
CUDA Graphs do come with some limitations. Because the graph structure needs to be determined in advance, they require a fixed batch size and input dimensions. This means separate graphs need to be created for each batch size or input shape. For our use case with Cube 3D, this limitation was acceptable, as we could standardize the inference process around common input dimensions.
We had to adapt our approach to implement CUDA Graphs for our Cube 3D model. In traditional LLMs, attention operations are always performed with the same sequence length, providing a static shape to work with. However, in our custom dual-stream architecture, some attention layers operate on the sequence length alone, while others operate on a combination of sequence and condition length.
Despite these challenges, we saw remarkable results after implementing CUDA Graphs. We use time per output token (TPOT) to measure generation time for each token during inference. After implementing CUDA Graphs, our TPOT improved from 60.5 milliseconds to 20.5 milliseconds, a 2.9x improvement. Overall generation time dropped 66%, from 31 seconds to 10.5 seconds.
KV Caching: Building on Our Success
To improve latency further, we implemented KV caching, a standard practice in LLM inference that has proved highly effective across the industry.
In transformer-based models like Cube 3D, each token generation requires computing key (K) and value (V) matrices based on all previously generated tokens. As the sequence grows longer, recomputing these matrices for every token becomes increasingly inefficient.
KV caching solves this by:
-
Storing the K and V matrices for all previously generated tokens
-
Computing the K and V matrices only for new tokens
-
Appending these new matrices to the cached values
This approach eliminates redundant computation, reducing the work required for each new token. This becomes especially impactful as the generated sequence grows longer.
Our approach to integrating KV caching with CUDA Graph implementation was similar to traditional LLM inference. The addition of KV caching reduced our TPOT to just 7.8 milliseconds. Overall generation time decreased 87%, from the original 31 seconds down to just 4 seconds. This significant time reduction makes it much more effective for creators using this tool.
Assessing Real-World Impact on Developers and Users
These improvements directly translate into tangible benefits for developers and users. Even with mesh post-processing, our final end-to-end (E2E) latency is seven seconds. Developers can now work in faster iteration cycles, and users experience more responsive 3D generation.
We are exploring techniques that further reduce latency and improve the user experience, including optimized kernels, model quantization for even faster inference, hardware-specific optimizations, and parallel token generation.
This work becomes more complex when we expand to full scene generation and understanding, where many 3D elements need to work together in context with one another within a layout. We also want the 3D objects and worlds we create to be fully functional, so doors open and close, wheels turn, etc. To get there, we need rapid generation and iteration to scale to entire scenes, fully functional objects, and avatars. We’re excited to share further improvements and new functionality as we expand our Cube 3D foundation model—and to see the immersive worlds our creator community builds with them.