DS190-Improving Ego Vehicle Performance with GRPO

In this project, I addressed the limitations of standard Imitation Learning in autonomous driving by applying Group Relative Policy Optimization (GRPO) to the Wayformer architecture. While data-driven planners excel at mimicking human behavior, they often struggle with “long-tail” scenarios—rare, high-stakes events—and fail to strictly enforce safety constraints like collision avoidance.

This project bridges that gap by integrating realistic imitation priors with rigorous reinforcement learning safety checks:

Wayformer with Temporal Gaussian Decoder I utilized the Wayformer architecture to encode heterogeneous scene contexts (road graph, traffic lights, agent interactions). To improve temporal coherence, I replaced the standard linear regression head with a Temporal Gaussian Decoder (TGD). This explicitly models dependencies between future time steps, ensuring generated trajectories are dynamically realistic.
Group Relative Policy Optimization (GRPO) Standard RL often requires computationally expensive Critic networks. I adapted GRPO, a critic-free algorithm, to fine-tune the model’s decision-making. By sampling a group of trajectory proposals and evaluating them against a hard-constraint reward function (penalizing collisions), GRPO optimizes the model to prioritize safety even when expert demonstrations are ambiguous or risky.

This approach successfully sharpens decision confidence and reduces collision rates in complex scenarios where pure imitation learning fails.

Example Results

Decision Sharpening: The baseline model (top) exhibits high uncertainty, splitting probability across conflicting maneuvers. The GRPO-tuned model (bottom) produces a "peaked" distribution, confidently selecting the optimal path.

Safety vs. Imitation: In this scenario, the baseline (top) mimics an expert path that dangerously heads toward a neighbor. GRPO (bottom) suppresses this dangerous mode, deviating from the imitation target to enforce a hard safety constraint.

Resources

📂 GitHub Repository
📑 Full Report (PDF)

References

Nayakanti, N., et al. (2023).
Wayformer: Motion forecasting via simple efficient attention networks. IEEE International Conference on Robotics and Automation (ICRA).
Shao, Z., et al. (2024).
DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
Ettinger, S., et al. (2021).
Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset. IEEE/CVF International Conference on Computer Vision (ICCV).