Hao (Peter) Yu

New York City, 2023.12

Hello, I’m Peter Yu, a MSc. (Thesis) student at McGill University and Mila. I’m supervised by Prof. David Ifeoluwa Adelani on multilingual language processing and low-resource languages. I am also collaborating with Shiwei Tong in Tencent, working on RAG and diffusion model on time series. Start from my undergraduate studies, I was supervised by Prof. Reihaneh Rabbany in detecting misinformation detection with RAG (continue working as collaborator).

Currently, my focus will be on advancing retrieval systems that adapt to human feedback. This research addresses critical challenges in current AI systems, specifically model staleness and knowledge conflicts, through unified knowledge embedding and preference-optimized knowledge distillation. Looking ahead, I aspire to develop AI systems that continuously learn and evolve by integrating human preferences and expertise, drawing inspiration from systems like Google Search which leverages user engagement as a quality signal.

Furthermore, I aim to expand beyond textual knowledge to encompass action spaces and emotional speech, transitioning from learning from humans to augmenting human capabilities. Ultimately, my goal is to develop meaningful, useful, and industry-ready products that create lasting impact.

Actively seeking Ph.D./industry opportunities in AI/NLP/ML.

Poster and Slide

RAG
- Evaluation of Retrieval-Augmented Generation: A Survey [Poster] CCF BigData 2024
- Web Retrieval Agents for Evidence-Based Misinformation Detection [Poster] COLM 2024
- Double Decomposition with Web-Augmented Verification for Misinfor Detection [Poster]
Multilingual
- INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages Under Review, 2025
- Ensembling Enhances Effectiveness of Multilingual Small LMs [Oral] EMNLP 2024 MRL Wining NER
Weak Supervision
- SWEET - Weakly Supervised Person Name Extraction for Fighting Human Trafficking [Poster] EMNLP 2023
Data Science
- How to Unlock Time Series Editing? A Diffusion-Driven Approach with Multi-Grained Contro Under Review, 2025
Paper Share
- Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages [Slide, 2025 Winter]
- GlotLID: Language Identification for Low-Resource Languages [Slide, 2025 Winter]
- Constitutional AI AND Collective Constitutional AI: Aligning a Language Model with Public Input (CCAI) [Slide, 2024 Fall]

🏸🏓⛰️📷

Resume: PDF

Motto: 脚踏实地行稳致远 (Work hard and steady, and will go far)

Publications

CCF BigData

Evaluation of Retrieval-Augmented Generation: A Survey

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu

Jun 2024

Abs arXiv DOI

Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
COLM 2024

Web Retrieval Agents for Evidence-Based Misinformation Detection

Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Mauricio Rivera, Zachary Yang, Jean-François Godbout, Reihaneh Rabbany, and Kellin Pelrine

Jun 2024

Abs

This paper develops an agent-based automated fact-checking approach for detecting misinformation. We demonstrate that combining a powerful LLM agent, which does not have access to the internet for searches, with an online web search agent yields better results than when each tool is used independently. Our approach is robust across multiple models, outperforming alternatives and increasing the macro F1 of misinformation detection by as much as 20 percent compared to LLMs without search. We also conduct extensive analyses on the sources our system leverages and their biases, decisions in the construction of the system like the search tool and the knowledge base, the type of evidence needed and its impact on the results, and other parts of the overall process. By combining strong performance with in-depth understanding, we hope to provide building blocks for future search-enabled misinformation mitigation systems.
EMNLP 2023

SWEET - Weakly Supervised Person Name Extraction for Fighting Human Trafficking

Javin Liu*, Hao Yu*, Vidya Sujaya*, Pratheeksha Nair, Kellin Pelrine, and Reihaneh Rabbany

In Findings of the Association for Computational Linguistics: EMNLP 2023, Dec 2023

Abs DOI

In this work, we propose a weak supervision pipeline SWEET: Supervise Weakly for Entity Extraction to fight Trafficking for extracting person names from noisy escort advertisements. Our method combines the simplicity of rule-matching (through antirules, i.e., negated rules) and the generalizability of large language models fine-tuned on benchmark, domain-specific and synthetic datasets, treating them as weak labels. One of the major challenges in this domain is limited labeled data. SWEET addresses this by obtaining multiple weak labels through labeling functions and effectively aggregating them. SWEET outperforms the previous supervised SOTA method for this task by 9% F1 score on domain data and better generalizes to common benchmark datasets. Furthermore, we also release HTGEN, a synthetically generated dataset of escort advertisements (built using ChatGPT) to facilitate further research within the community.