AI summary 1 แหล่ง · 3 วันก่อน

ผู้วิจัยแก้ปัญหา RLHF ด้วย DPO, Bandit Learning และ Multi-Agent ในงาน High-Stakes

กลุ่มบทความ arXiv ใหม่เน้นการปรับปรุง reinforcement learning ให้ปลอดภัยและควบคุมได้ในงานที่มีความเสี่ยงสูง เช่นการตัดสินใจเครื่องช่วยหายใจ ปัญหาหลักคือ RLHF กับ DPO ไม่เสมอเทียบเท่า, mode collapse ทำให้ agent หยุดสำรวจทางเลือก, และการแยกแยะความไม่แน่นอน (volatility vs stochasticity) ส่งผลต่อการตัดสินใจ งานเหล่านี้เสนอวิธีใหม่: contextual bandit สำหรับ personalization, distribution matching เพื่อรักษาความหลากหลาย, และ uncertainty-aware expert advice เพื่อสมดุลระหว่างการเรียนรู้กับความปลอดภัย

แหล่งข่าว

ประเด็น

3 วันก่อน

อัปเดต

DPO ไม่เทียบเท่า RLHF ทั้งหมด — ขึ้นอยู่กับสมมติฐานที่มักผิดพลาดในการใช้งานจริง
Mode collapse ใน on-policy RL ทำให้ agent หยุดสำรวจ — DMPO ใช้ distribution matching แทน reverse KL
Multi-agent + human-in-the-loop ดีกว่า end-to-end LLM สำหรับงาน high-stakes เช่นการแพทย์

แหล่งต้นทาง · 7

ลิงก์ต้นทางอยู่ครบ เพื่อให้เปิดอ่านเต็มและเทียบข้อมูลเองได้

arXiv — cs.AI 3 วันก่อน

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

arXiv — cs.AI 25 พ.ค.

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

arXiv — cs.AI 23 พ.ค.

Implicit Safety Alignment from Crowd Preferences

arXiv — cs.AI 22 พ.ค.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

arXiv — cs.AI 20 พ.ค.

Not all uncertainty is alike: volatility, stochasticity, and exploration