Hello, welcome

I’m a Ph.D. student in Computer Science at ETH Zurich, advised by Prof. Niao He. Before that, I was a Ph.D. student in Computer Science at University of Illinois at Urbana–Champaign (UIUC), advised by Prof. Nan Jiang. I completed my B.E. in Computer Science at Beihang University.

My research primarily focuses on reinforcement learning (RL) and, more broadly, sequential decision-making under uncertainty. I work on understanding the fundamental mathematical principles underlying the problems and leveraging theoretical insights to develop efficient and practical algorithms. I’m particularly interested in bridging the gap between theory and practice, designing algorithms that come with theoretical supports and demonstrate empirical performance.

My previous research spans a broad spectrum of topics including:

Reinforcement Learning from Human Feedback (RLHF): Developing methods to align AI with human preferences.
Multi-Agent Reinforcement Learning (MARL): Understanding learning efficiency in multi-agent systems.
Offline Reinforcement Learning: Advancing learning algorithms in offline setting.

Contacts: Google Scholar | LinkedIn | Github | jiawei.huang [at] inf [dot] ethz [dot] ch

Research Highlights

Reinforcement Learning from Human Feedback

Sample efficiency is crucial in online RLHF. Our ICML 2025 paper addresses it through two innovations. (1) RLHF differs from conventional RL in the additional KL regularization. We investigate and leverage special structures introduced by the KL term. (2) While previous works focus on strategic exploration, we study the benefits of transfer learning.

ICML 2025

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann, and Niao He

International Conference on Machine Learning, 2025

Abs PDF Code Poster

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property of the KL-regularized RLHF objective: \empha policy’s ability to cover the optimal policy is captured by its sub-optimality. Building on this insight, we propose a theoretical transfer learning algorithm with provable benefits compared to standard online learning. Our approach achieves low regret in the early stage by quickly adapting to the best available source reward models without prior knowledge of their quality, and over time, it attains an \tildeO(\sqrtT) regret bound \emphindependent of structural complexity measures. Inspired by our theoretical findings, we develop an empirical algorithm with improved computational efficiency, and demonstrate its effectiveness empirically in summarization tasks.

MARL and Game Theory

Learning equilibrium in large-population systems is challenging in general, but can be tractable if special structures present. Our ICML 2024 paper studies Mean-Field Games (MFGs), a class of large-population system with symmetric structure. We show that learning in MFGs is actually not much harder than single-agent RL.

ICML 2024

Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL

Jiawei Huang, Niao He, and Andreas Krause

International Conference on Machine Learning, 2024

Abs PDF Code Slides

We study the sample complexity of reinforcement learning (RL) in Mean-Field Games (MFGs) with model-based function approximation that requires strategic exploration to find a Nash Equilibrium policy. We introduce the Partial Model-Based Eluder Dimension (P-MBED), a more effective notion to characterize the model class complexity. Notably, P-MBED measures the complexity of the single-agent model class converted from the given mean-field model class, and potentially, can be exponentially lower than the MBED proposed by \citethuang2023statistical. We contribute a model elimination algorithm featuring a novel exploration strategy and establish sample complexity results polynomial w.r.t. P-MBED. Crucially, our results reveal that, under the basic realizability and Lipschitz continuity assumptions, \emphlearning Nash Equilibrium in MFGs is no more statistically challenging than solving a logarithmic number of single-agent RL problems. We further extend our results to Multi-Type MFGs, generalizing from conventional MFGs and involving multiple types of agents. This extension implies statistical tractability of a broader class of Markov Games through the efficacy of mean-field approximation. Finally, inspired by our theoretical algorithm, we present a heuristic approach with improved computational efficiency and empirically demonstrate its effectiveness.

Typical agents' learning dynamics do not always lead to desirable outcomes. Our ICLR 2025 paper studies the steering setup, where agents' learning dynamics can be influenced by external steering rewards (e.g. financial subsidy by government). We explore how to design these rewards to efficiently guide the agents towards desired policies.

ICLR 2025

Learning to Steer Markovian Agents under Model Uncertainty

Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich H. Nax, and Niao He

International Conference on Learning Representations, 2025

Abs PDF Code Slides

Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emphwithout prior knowledge of the agents’ underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emphMarkovian agents. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emphhistory-dependent steering strategy to handle the inherent model uncertainty about the agents’ learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations.

Others

Early in my Ph.D., I explored various topics in single-agent online/offline RL. Motivated by the practical policy switching constraints, our ICLR 2022 paper introduces the deployment-efficient setup, and develops efficient algorithms that match our established lower bounds.

ICLR 2022
(Spotlight)

Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, and Tie-Yan Liu

International Conference on Learning Representations, 2022

Abs PDF OpenReview Slides

Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community’s increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal \emphdeployment complexity, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give "Safe DE-RL" and "Sample-Efficient DE-RL" as two examples, which may be worth future investigation.