How AI Training (Inputs) Works and Its Legality

Training an AI model requires massive amounts of data, texts, images, audio, and more. To train the model, copies of this data are made at some point. Some data is free to use (public domain or not copyright-protected), but much of it is protected by copyright. Since copying is a right controlled by creators, this raises legal concerns. For example, using a novel or scraping news articles to train an AI model may involve copyrighted material, which has prompted numerous lawsuits. At face value, the argument is simple: AI companies collect online content, use it for training, and this copying could infringe copyright.

However, this oversimplifies how AI training actually works. Copying does not mean storing full works in their original form. First, companies collect huge datasets from multiple sources (texts, images, audio recordings) where each individual work represents only a tiny fraction of the dataset. What matters is the collective data, not any single work. The data is then cleaned and processed: duplicates and errors are removed, and the remaining content is converted into a computer-friendly format, often making the original works unrecognizable. During training, the AI learns patterns (e.g., predicting “mat” after “The cat sat on the…”) by adjusting its internal parameters repeatedly. Importantly, the model does not store copies of the original data; it only internalizes patterns.

Even so, some copies of protected works are inevitably made during training, which forms the core legal question: is that copying allowed? In the United States, courts have offered mixed guidance. Some decisions suggest certain AI training may qualify as fair use, but using illegally obtained content makes a fair use argument difficult to sustain. In the European Union, regulations are still developing. The EU has introduced a voluntary Code of Practice for general-purpose AI models (those capable of performing many different tasks). Companies like Google, Anthropic, and OpenAI have signaled willingness to comply, while others, such as Meta, have not.

Two key points stand out in the EU approach: (1) an opt-out system, allowing creators to forbid the use of their works for AI training, and (2) a rule against circumventing technical protections. Both US and EU trends point toward a common principle: AI training may be allowed but using pirated or illegally obtained content is not acceptable. The law is still evolving, lawsuits continue, and governments may consider additional measures, such as requiring AI companies to pay for using copyrighted data.

AI training sits at the crossroads of innovation and legality. While enormous datasets are essential for creating capable models, the act of copying even small portions of protected works raises real copyright questions. In practice, AI models learn patterns rather than store full copies, which complicates the legal picture, but does not remove the responsibility of respecting creators’ rights. Across the US and EU, courts and regulators are beginning to clarify the rules: some training may be permissible, but unlawful or pirated content is off-limits. As the legal landscape evolves, AI developers, providers, and users alike must navigate these complexities carefully, balancing technological progress with respect for intellectual property. Staying informed, compliant, and proactive will be key for the responsible development and deployment of AI in the years to come.

Inspired by “Revisiting Copyright Infringement in AI Inputs and Outputs” by Andres Guadamuz (July 30, 2025), this article reflects my goal of sharing insights on AI governance, responsible innovation, and the evolving tech landscape to help professionals stay informed and thoughtful about emerging AI challenges.

Article published on 30 March 2026 by Dilmurod Erkinov (Edu-LegalTech)