Foundations of Multimodal AI

by Human AI Research Division

Abstract

The emergence of general-purpose artificial intelligence systems capable of perceiving, reasoning, and acting across diverse real-world environments depends critically on three interlocking pillars: the collection of high-quality, multimodal data at scale; the disciplined application of supervised fine-tuning (SFT) to align model behavior with expert knowledge; and the iterative refinement of policy through reinforcement learning (RL). This paper examines each pillar in depth, analyzing how text corpora, audio recordings, image datasets, and video demonstrations collectively shape model representations, and how SFT and RL training regimes transform raw pretrained networks into capable, aligned agents.

We place particular emphasis on the emerging class of Vision-Language-Action (VLA) models, which must integrate perception, language understanding, and continuous motor control, and argue that the involvement of domain experts and human researchers at every stage of the data and training pipeline is not merely beneficial but constitutive of robust AI development. We further contend that organizations committed to rigorous data infrastructure and human-in-the-loop training methodologies are uniquely positioned to drive the next generation of embodied AI breakthroughs.

Keywords: multimodal data collection, supervised fine-tuning, reinforcement learning from human feedback, vision-language-action models, embodied AI, human expert annotation, data pipelines, large language models

1. Introduction

The trajectory of modern artificial intelligence research has been shaped by a consistent empirical finding: model capability scales with data quality, diversity, and volume [1]. From early convolutional networks trained on ImageNet [2] to contemporary transformer-based large language models (LLMs) pretrained on trillions of tokens [3], the data pipeline has remained the foundational determinant of generalization performance. Yet as AI systems are increasingly deployed in open-ended, physically grounded, and safety-critical settings, the challenge of data collection has grown substantially more complex.

Modern AI development is no longer confined to a single modality. Advances in multimodal learning have demonstrated that models trained jointly on text, images, audio, and video acquire richer, more transferable representations than those trained on any single modality alone [4]. Simultaneously, the rise of Vision-Language-Action (VLA) models—systems that must ground language instructions in visual perception and translate them into executable motor commands—has introduced new requirements for the kinds of behavioral data that must be collected and the training methodologies required to exploit it [5].

This paper provides a structured treatment of these interconnected challenges. Section 2 surveys the landscape of multimodal data collection across text, audio, image, and video modalities. Section 3 analyzes the role of supervised fine-tuning in shaping expert-aligned model behavior. Section 4 examines reinforcement learning frameworks, with emphasis on reinforcement learning from human feedback (RLHF) and its extensions. Section 5 addresses VLA models as a case study in the convergence of these methodologies. Section 6 argues for the indispensability of human expertise throughout the training pipeline. Section 7 discusses future directions, and Section 8 concludes.

2. Multimodal Data Collection: Scope, Scale, and Quality

2.1 Text Data

Text remains the most extensively studied modality in modern AI. The development of LLMs such as GPT-4 [6], LLaMA [7], and Claude [8] has been enabled by pretraining on diverse web-scale corpora including Common Crawl, Books3, Wikipedia, and curated domain-specific datasets. The quality filtering, deduplication, and toxicity mitigation applied during text curation are known to have significant downstream effects on model behavior [9]. Beyond raw pretraining data, instruction-following datasets—such as FLAN [10], OpenAssistant [11], and Dolly [12]—have demonstrated that relatively small amounts of carefully annotated task-oriented data can produce large improvements in zero-shot and few-shot performance.

Critically, domain-specific text data—encompassing scientific literature, legal corpora, medical records, and engineering documentation—has become essential for deploying AI in specialized professional contexts. The curation of such data typically requires collaboration with subject-matter experts who can verify factual accuracy, flag anachronistic information, and identify distributional gaps that general-purpose corpora cannot address [13].

2.2 Audio Data

Audio data spans a rich spectrum of signals: natural speech, environmental acoustics, music, and physiological recordings. In the context of AI model development, speech data is particularly consequential. Large-scale speech corpora such as LibriSpeech [14], VoxPopuli [15], and Common Voice [16] have underpinned significant advances in automatic speech recognition (ASR), speaker diarization, and end-to-end spoken language understanding. Models such as Whisper [17] have demonstrated that training on diverse, multilingual audio datasets with weak supervision enables robust cross-lingual transfer.

For embodied AI systems, audio is not merely a communication channel but an environmental sensing modality. Acoustic cues can signal the state of physical processes, the presence of objects, and the actions of agents in a shared environment. Collecting temporally aligned audio-visual data in naturalistic settings—a technically demanding enterprise—is therefore increasingly prioritized in robotics research [18].

2.3 Image Data

Visual perception remains central to most real-world AI applications. ImageNet [2], COCO [19], and OpenImages [20] established the paradigm of large-scale supervised visual pretraining. The contrastive learning approach pioneered by CLIP [21] extended this paradigm to joint image-text representation spaces, enabling zero-shot visual classification and cross-modal retrieval at unprecedented scale.

For VLA systems and robotics applications, image data must capture not only object appearances but spatial relationships, depth cues, part articulations, and action affordances. Egocentric datasets such as EPIC-Kitchens [22] and Ego4D [23], which capture human activities from a first-person perspective, have become particularly valuable for training robot manipulation policies, as the viewpoint geometry approximates that of a robot-mounted camera. High-quality annotation of semantic segmentation masks, 3D bounding boxes, and contact points requires specialized expert annotators with domain knowledge [24].

2.4 Video Data

Video data integrates temporal dynamics with visual perception, making it the richest and most information-dense modality. Large-scale video datasets such as Kinetics [25], Something-Something [26], and HowTo100M [27] have enabled significant progress in action recognition, temporal reasoning, and procedural understanding. For robot learning, demonstration videos provide implicit supervisory signals: a robot can learn manipulation skills by watching humans perform tasks, using techniques such as inverse dynamics modeling [28] or video-conditioned imitation learning [29].

The collection of task-specific robot demonstration data—teleoperated trajectories, kinesthetic teaching, or motion-captured human demonstrations—remains one of the primary bottlenecks in VLA development. Initiatives such as the Open X-Embodiment dataset [30], which aggregates robot demonstrations across 22 embodiments and over one million trajectories, represent a community-scale response to this challenge. Ensuring temporal consistency, annotation fidelity, and representational diversity across such datasets demands sustained human expert involvement at every stage of collection and curation.

Integrated AI Training Pipeline

Raw dataText / Audio / Image / Video
Expert annotationCuration & filtering
Foundation modelPretraining
Aligned behaviorSFT
Policy refinementRL / RLHF

Figure 1. Schematic of the integrated AI training pipeline, from raw multimodal data collection through pretraining, supervised fine-tuning, and reinforcement learning-based alignment.

3. Supervised Fine-Tuning: Instilling Expert Knowledge

3.1 Principles and Practice

Supervised fine-tuning (SFT) refers to the process of updating a pretrained model's parameters on a curated dataset of input-output pairs that demonstrate desired behavior. Whereas pretraining exposes a model to broad distributional knowledge, SFT focuses the model's representations and generation tendencies on specific tasks, formats, or behavioral norms [31]. The InstructGPT work of Ouyang et al. [32] provided an influential demonstration that SFT on a relatively small set of human-written demonstrations could substantially improve instruction-following ability, even in comparison to much larger models that had not been fine-tuned.

The effectiveness of SFT is highly sensitive to data quality. Models fine-tuned on noisy or inconsistent demonstrations tend to inherit those inconsistencies, a phenomenon sometimes called "garbage in, garbage out" at the alignment level [33]. Conversely, datasets annotated by domain experts with clear rubrics and calibration procedures have been shown to produce significantly more reliable fine-tuned models, particularly on tasks requiring technical accuracy, nuanced judgment, or safety-critical decision-making [34].

3.2 SFT for Multimodal and Embodied Systems

In the context of multimodal models, SFT must operate across heterogeneous input spaces. Flamingo [35], LLaVA [36], and InstructBLIP [37] all employ variants of visual instruction tuning, wherein image-text pairs are assembled by human annotators or generated through semi-automated pipelines and used to fine-tune a pretrained multimodal backbone on instruction-following tasks. The construction of such datasets involves careful decisions about task diversity, prompt formatting, and the balance between generative and discriminative objectives.

For robotics and VLA systems, SFT takes the form of imitation learning (IL) or behavior cloning (BC), wherein a policy network is trained to reproduce expert demonstrations. The quality of the behavioral dataset directly determines the competence ceiling of the imitation-learned policy [38]. Limitations of standard BC—particularly its susceptibility to distributional shift and compounding errors—have motivated the development of augmented imitation learning methods such as DAgger [39] and its successors, which interleave expert correction with autonomous policy rollouts to produce more robust learned behaviors.

4. Reinforcement Learning: From Static Imitation to Dynamic Optimization

4.1 Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as one of the most impactful methodological innovations in post-pretraining alignment of LLMs [32, 40]. The canonical RLHF pipeline proceeds in three stages: (1) supervised fine-tuning of the base model on demonstration data; (2) training a reward model on human preference comparisons between model outputs; and (3) optimizing the SFT policy against the reward model using a policy gradient algorithm such as PPO [41], subject to a KL-divergence penalty to prevent excessive deviation from the SFT baseline.

The reward model is the critical mediator between human judgment and model behavior. Its quality depends on the reliability and calibration of the human raters who supply preference labels. Research by Stiennon et al. [40] and Bai et al. [42] has demonstrated that RLHF-trained models consistently outperform SFT-only models on open-ended generation tasks when evaluated by independent human judges, with improvements that scale with the quality of the preference data.

4.2 Constitutional AI and Process-Based Reward Modeling

Constitutional AI (CAI), introduced by Anthropic [43], extends the RLHF framework by incorporating a set of principles into the reward signal, allowing the model to critique and revise its own outputs with reduced reliance on human feedback at each iteration. This approach has important implications for scalable oversight: as AI systems become capable enough to generate outputs that human raters cannot easily evaluate, process-based reward models that assess the reasoning chain rather than just the final output may be necessary to maintain alignment [44].

Direct Preference Optimization (DPO) [45] offers an alternative to explicit reward model training, reformulating the RLHF objective as a binary cross-entropy loss over preference pairs. DPO has been shown to be computationally efficient and achieves competitive performance on a range of alignment benchmarks, though it remains an active research question whether DPO or PPO-based RLHF offers superior generalization in complex, open-ended settings.

4.3 RL for Embodied Agents and Robotics

In robotics and embodied AI, RL has a longer history, with foundational contributions from model-free methods such as DDPG [46], SAC [47], and TD3 [48], as well as model-based approaches including MBPO [49] and Dreamer [50]. The primary challenge in applying RL to real-world robotics is sample efficiency: physical systems cannot undergo the millions of environment interactions that simulation-based training permits. Sim-to-real transfer techniques—domain randomization [51], system identification, and adaptive dynamics models—partially address this limitation but remain an active area of research.

Recent work has explored combining imitation learning with RL in hybrid frameworks that use human demonstrations to initialize a policy, reducing the exploration burden, and then fine-tune with RL to exceed demonstrator performance—a paradigm sometimes called "learning from demonstrations plus RL" [52]. This combination is particularly relevant for VLA systems, where diverse demonstration data can bootstrap capable initial policies that RL can then refine on harder, long-horizon tasks.

5. Vision-Language-Action Models: The Convergence Frontier

5.1 Architecture and Capabilities

Vision-Language-Action (VLA) models represent the frontier of embodied AI: systems that accept visual observations and natural language instructions as inputs and produce executable motor commands as outputs. The seminal RT-2 model [5], developed by Google DeepMind, demonstrated that a vision-language model pretrained on web-scale image-text data could be fine-tuned to perform robot manipulation tasks by tokenizing robot actions and treating action prediction as a next-token prediction problem. Subsequent models including π₀ (pi-zero) [53] and OpenVLA [54] have extended this paradigm, exploring different tokenization schemes, action head architectures, and training data mixtures.

The architectural design of VLA models involves tradeoffs between representational capacity, inference latency, and action precision. Autoregressive action prediction, while benefiting from the full expressive power of transformer decoders, may introduce latency that is problematic for high-frequency control tasks. Diffusion-based action heads [55] offer an alternative that can produce smooth, continuous action distributions but require careful integration with the language backbone. These architectural choices have significant implications for the types of training data and fine-tuning procedures that are effective.

5.2 Data Requirements for VLA Training

VLA models impose uniquely demanding data requirements. Effective training requires: (a) large-scale internet pretraining data to develop broad visual-semantic representations; (b) robot demonstration datasets spanning diverse manipulation tasks, embodiments, and environments; (c) language-annotated trajectories that align instruction semantics with action sequences; and (d) preference or success/failure labels for reward model training or RL fine-tuning [30, 56]. No single existing dataset satisfies all of these requirements simultaneously, motivating the development of heterogeneous data mixtures and cross-embodiment training strategies.

DatasetModalitiesScalePrimary use
Open X-EmbodimentVideo, proprioception, language1M+ trajectoriesCross-embodiment robot pretraining
LAION-5BImage, text5.85B pairsVision-language pretraining
Ego4DVideo (egocentric), audio, text3,670 hoursEmbodied perception, action anticipation
BridgeData V2Video, language60k trajectoriesLanguage-conditioned manipulation
DROIDVideo, language, proprioception76k trajectoriesDiverse robot manipulation

Table 1. Selected large-scale multimodal datasets relevant to VLA model training, illustrating the diversity of modalities, scales, and intended applications.

5.3 Training Methodology for VLAs

State-of-the-art VLA training pipelines typically follow a multi-stage procedure. In the first stage, a vision-language backbone is pretrained on web-scale image-text data using contrastive or generative objectives. In the second stage, the model undergoes robot-specific SFT on language-annotated demonstration datasets, learning to map visual observations and language instructions to action sequences. In the third stage, RL fine-tuning—using either real-world task completion signals, simulated reward functions, or preference labels from human evaluators—further refines the policy to handle distribution shifts and long-horizon task structures [53, 57].

The fidelity and coverage of the demonstration data used in stage two is the primary determinant of out-of-distribution generalization in VLA systems. Human experts—roboticists, occupational therapists, industrial technicians, and domain specialists—provide demonstrations that encode tacit knowledge about task structure, error recovery strategies, and safe manipulation practices that cannot easily be specified programmatically. The collection of such demonstrations is therefore a fundamentally human-centric enterprise.

6. The Indispensable Role of Human Experts and Researchers

6.1 Annotation, Curation, and Quality Control

At every stage of the AI training pipeline, human judgment is required to define evaluation criteria, identify data quality failures, and adjudicate ambiguous cases. The construction of preference datasets for RLHF, the design of rubrics for SFT annotation, and the specification of success criteria for RL reward functions all demand domain expertise that cannot be automated away [58]. Studies on annotator agreement have consistently found that annotation quality improves when annotators have relevant domain knowledge, are provided with detailed guidelines, and receive calibration training [59].

Beyond annotation, human experts contribute to data curation through active learning—selectively labeling the examples most likely to improve model performance—and through adversarial data collection, in which researchers deliberately probe model failures to construct targeted training examples that address identified weaknesses [60]. This human-in-the-loop approach to data collection is particularly valuable in safety-critical domains, where the cost of model failure is high and distributional coverage must be carefully controlled.

6.2 Researchers as Drivers of Methodological Innovation

The advances described in this paper—from transformer architectures to RLHF to VLA models—have been driven by sustained research investment and creative methodological innovation. The field of AI is characterized by rapid empirical progress in which architectural choices, training procedures, and data curation strategies interact in complex, often counterintuitive ways. Human researchers provide the hypothesis generation, experimental design, and interpretive frameworks necessary to navigate this complexity and to make principled progress rather than merely empirical exploration [61].

Interdisciplinary collaboration is increasingly essential. Progress in VLA systems has required integration of insights from natural language processing, computer vision, reinforcement learning, robotics, cognitive science, and human-computer interaction. Research teams that bring together diverse expertise—and that maintain close collaboration with practitioners who will ultimately deploy these systems—are better positioned to identify the right problems and to develop solutions that generalize beyond academic benchmarks [62].

6.3 Human Oversight and AI Safety

As AI systems become more capable and more autonomously deployed, human oversight takes on heightened importance. The scalable oversight research program [44] addresses the challenge of maintaining meaningful human control over AI systems whose outputs are increasingly difficult for individual humans to evaluate. Techniques such as debate [63], recursive reward modeling [64], and interpretability-based oversight [65] aim to keep human judgment in the loop even as AI capabilities advance. Human AI's commitment to expert-driven data collection and human-in-the-loop training is directly aligned with this broader safety research agenda.

7. Future Directions

Several research directions are poised to significantly advance the state of the art in multimodal AI and VLA systems. First, the development of more efficient data collection methodologies—including synthetic data generation via generative models [66], world models for robot learning [67], and semi-automated annotation pipelines—will be essential to scale training data without proportionally scaling human annotation costs. Second, improvements in sim-to-real transfer will reduce dependence on costly real-world demonstration collection, though the fundamental role of human expertise in defining task objectives and evaluating policy quality will remain [51].

Third, the integration of explicit world models into VLA architectures offers the prospect of systems capable of counterfactual reasoning, long-horizon planning, and graceful failure recovery—capabilities that current imitation-based approaches struggle to deliver reliably [68]. Fourth, the development of standardized evaluation benchmarks for VLA generalization—spanning diverse tasks, environments, embodiments, and language instruction styles—will be critical for measuring progress and identifying remaining capability gaps [69]. Human researchers will continue to play a central role in designing these benchmarks, ensuring they reflect the challenges of real-world deployment rather than the idiosyncrasies of laboratory settings.

8. Conclusion

The development of capable, safe, and generalizable AI systems—whether large language models, multimodal foundation models, or Vision-Language-Action systems—rests on a triad of foundational commitments: the collection of diverse, high-quality multimodal data; the application of principled supervised fine-tuning to align model behavior with expert knowledge; and the iterative optimization of model policies through reinforcement learning grounded in human preference and environmental feedback. None of these pillars stands alone. Data without fine-tuning is inert; fine-tuning without reinforcement learning is brittle; reinforcement learning without data is blind.

Human AI's approach to AI development reflects a deep understanding of this interdependence. By investing in expert-driven data collection infrastructure, rigorous annotation pipelines, and state-of-the-art training methodologies, and by building teams of researchers who combine technical depth with domain breadth, we are positioned to contribute meaningfully to the frontier of AI capability and to do so in a manner that keeps human expertise, human values, and human oversight at the center of the enterprise. We invite researchers, domain experts, and institutional partners who share this vision to engage with us in building the next generation of AI systems.

References

[1] Kaplan, J., et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[2] Deng, J., et al. (2009). ImageNet: A large-scale hierarchical image database. CVPR 2009.

[3] Touvron, H., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971.

[4] Zhu, D., et al. (2023). MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592.

[5] Brohan, A., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv:2307.15818.

[6] OpenAI. (2023). GPT-4 technical report. arXiv:2303.08774.

[7] Touvron, H., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.

[8] Anthropic. (2024). The Claude 3 model family: Opus, Sonnet, Haiku. Anthropic Technical Report.

[9] Longpre, S., et al. (2023). A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, and toxicity. arXiv:2305.13169.

[10] Wei, J., et al. (2022). Finetuned language models are zero-shot learners. ICLR 2022.

[11] Köpf, A., et al. (2023). OpenAssistant conversations: Democratizing large language model alignment. NeurIPS 2023.

[12] Conover, M., et al. (2023). Free Dolly: Introducing the world's first truly open instruction-tuned LLM. Databricks Blog.

[13] Singhal, K., et al. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.

[14] Panayotov, V., et al. (2015). LibriSpeech: An ASR corpus based on public domain audio books. ICASSP 2015.

[15] Wang, C., et al. (2021). VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. ACL 2021.

[16] Ardila, R., et al. (2020). Common Voice: A massively-multilingual speech corpus. LREC 2020.

[17] Radford, A., et al. (2023). Robust speech recognition via large-scale weak supervision. ICML 2023.

[18] Gao, R., et al. (2023). Physically grounded vision-language models for robotic manipulation. ICRA 2023.

[19] Lin, T.-Y., et al. (2014). Microsoft COCO: Common objects in context. ECCV 2014.

[20] Kuznetsova, A., et al. (2020). The Open Images Dataset V4. IJCV, 128(7), 1956–1981.

[21] Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021.

[22] Damen, D., et al. (2022). Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-Kitchens-100. IJCV, 130(1), 33–55.

[23] Grauman, K., et al. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. CVPR 2022.

[24] Shridhar, M., et al. (2020). ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. CVPR 2020.

[25] Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the Kinetics dataset. CVPR 2017.

[26] Goyal, R., et al. (2017). The "Something Something" video database for learning and evaluating visual common sense. ICCV 2017.

[27] Miech, A., et al. (2019). HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. ICCV 2019.

[28] Baker, B., et al. (2022). Video pretraining (VPT): Learning to act by watching unlabeled online videos. NeurIPS 2022.

[29] Schmeckpeper, K., et al. (2020). Learning predictive models from observation and interaction. ECCV 2020.

[30] Open X-Embodiment Collaboration. (2023). Open X-Embodiment: Robotic learning datasets and RT-X models. arXiv:2310.08864.

[31] Rafailov, R., et al. (2024). From r to Q*: Your language model is secretly a Q-function. arXiv:2404.12358.

[32] Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.

[33] Zhou, C., et al. (2024). LIMA: Less is more for alignment. NeurIPS 2024.

[34] Guo, S., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.

[35] Alayrac, J.-B., et al. (2022). Flamingo: A visual language model for few-shot learning. NeurIPS 2022.

[36] Liu, H., et al. (2024). LLaVA-1.5: Improved baselines with visual instruction tuning. CVPR 2024.

[37] Dai, W., et al. (2023). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. NeurIPS 2023.

[38] Pomerleau, D. (1989). ALVINN: An autonomous land vehicle in a neural network. NeurIPS 1989.

[39] Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. AISTATS 2011.

[40] Stiennon, N., et al. (2020). Learning to summarize with human feedback. NeurIPS 2020.

[41] Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.

[42] Bai, Y., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862.

[43] Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.

[44] Bowman, S., et al. (2022). Measuring progress on scalable oversight for large language models. arXiv:2211.03540.

[45] Rafailov, R., et al. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.

[46] Lillicrap, T., et al. (2016). Continuous control with deep reinforcement learning. ICLR 2016.

[47] Haarnoja, T., et al. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML 2018.

[48] Fujimoto, S., et al. (2018). Addressing function approximation error in actor-critic methods. ICML 2018.

[49] Janner, M., et al. (2019). When to trust your model: Model-based policy optimization. NeurIPS 2019.

[50] Hafner, D., et al. (2023). Mastering diverse domains through world models. arXiv:2301.04104.

[51] Tobin, J., et al. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. IROS 2017.

[52] Rajeswaran, A., et al. (2018). Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. RSS 2018.

[53] Black, K., et al. (2024). π₀: A vision-language-action flow model for general robot control. arXiv:2410.24164.

[54] Kim, M., et al. (2024). OpenVLA: An open-source vision-language-action model. arXiv:2406.09246.

[55] Chi, C., et al. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. RSS 2023.

[56] Walke, H., et al. (2023). BridgeData V2: A dataset for robot learning at scale. CoRL 2023.

[57] Hejna, J., et al. (2023). Inverse preference learning: Preference-based RL without a reward function. NeurIPS 2023.

[58] Ziegler, D., et al. (2019). Fine-tuning language models from human preferences. arXiv:1909.08593.

[59] Gordon, M., et al. (2021). The disagreement deconvolved: Bringing machine learning performance metrics in line with reality. ACM FAccT 2021.

[60] Zhu, Y., et al. (2020). Robosuite: A modular simulation framework and benchmark for robot learning. arXiv:2009.12293.

[61] Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

[62] Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258.

[63] Irving, G., et al. (2018). AI safety via debate. arXiv:1805.00899.

[64] Leike, J., et al. (2018). Scalable agent alignment via reward modeling. arXiv:1811.07871.

[65] Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. NeurIPS 2023.

[66] Yu, T., et al. (2023). Scaling robot learning with semantically imagined experience. RSS 2023.

[67] Micheli, V., et al. (2023). Transformers are sample-efficient world models. ICLR 2023.

[68] Ha, D., & Schmidhuber, J. (2018). World models. arXiv:1803.10122.

[69] Huang, W., et al. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML 2022.