Ziteng Sun (Google Research)

Feb 5, 2026 (at 3pm)
Soda Hall 510

Title and Abstract

Advances in Accelerating LLM Inference and Test-Time Compute Through Speculative Drafts

Test-time compute is the driving force behind recent advances in large language model (LLM) capabilities. Increasing compute often comes at the expense of higher user-facing latency, directly impacting user experience. One bottleneck for LLM decoding is that autoregressive sampling generates tokens one at a time. Speculative decoding breaks this sequential dependency by using a small model to sample a draft (block of tokens), and then scoring all tokens in the draft by the large model in parallel to reduce latency. A subset of the tokens in the draft are accepted (and the rest rejected) with a statistical verification procedure to guarantee that the final output follows the distribution of the large mode, providing lossless speedup. In this talk, we provide a principled understanding of speculative decoding through the lens of optimal transport. This new formulation enables us to design improved speculative decoding algorithms, including block verification, and multi-draft speculative decoding, while maintaining the strong lossless guarantee, and establish their optimality guarantees.

Further, for test-time compute applications such as reward-guided multi-step reasoning, we relax the lossless acceleration requirement and design a reward-guided soft verification procedure. Empirical results show that our algorithm matches the accuracy of SOTA test-time scaling methods while reducing latency by up to 18%.

Bio

Ziteng Sun is a research scientist at Google Research focused on developing efficient and responsible algorithms for foundation models. His work includes topics such as efficient methods for large language models, language model alignment, and privacy-preserving machine learning. His research interest lies broadly in machine learning, algorithmic statistics, and information theory. He obtained his BS from Tsinghua University, and his PhD from Cornell University, advised by Professor Jayadev Acharya