Introduction
The rise of generative artificial intelligence (AI) has sparked a pivotal legal debate that now stands on the edge of a major turning point. At the heart of the issue is a fundamental question: can the use of copyrighted works to train generative AI models be considered fair use, or does it constitute infringement? For years, developers have trained these models on vast datasets that include copyrighted materials, largely under the assumption that such use falls within fair use doctrine. Moreover, many have argued that impeding free access to these materials would be catastrophic to technological progress and place the U.S. at a disadvantage on the global stage. Copyright holders and creators, on the other hand, have voiced their concern that their own works are being used without their permission to produce content with which they must compete, placing their livelihoods at risk.
In May 2025, the United States Copyright Office (USCO) released Part III of its three-part report ("the Report") on copyright and AI, offering a detailed analysis of this contested issue. The Report outlines arguments presented by commenters on both sides, addresses the current state of the law, and provides the USCO's official recommendations to guide lawmakers, law practitioners, and businesses. In this newsletter, we unpack the key takeaways from the Report and discuss their potential impact on copyright holders, AI developers, and the legal landscape at large.
Basics of Generative AI
Generative AI refers to a category of machine learning (ML) models that undergo supervised learning to generate media (e.g., text, images, audio) based on user requests. These models attempt to map an input (such as a user's prompt or text string) to an output, the requested media, by a technique referred to as "next token prediction." This technique involves predicting the next word, phrase, or other data element (i.e., a "token") based on the preceding context. To "learn" this mapping effectively, generative AI models are trained to perform next token prediction on curated media that is representative of the expected inputs and the desired outputs.
State-of-the-art generative AI models are trained on enormous amounts of data in order to provide accurate responses to diverse inputs. For example, many of the most popular generative AI models (e.g., OpenAI's ChatGPT, Meta AI's LLaMA, Anthropic's Claude) are trained on datasets that include the common crawl, which includes over 250 billion web pages in the publicly accessible internet. However, the quality of the training data may be even more important than the volume. Consequently, copyrighted materials also provide an essential contribution to training datasets because they are among the highest quality, human-created content available. Yet by including copyrighted materials in the training data, generative AI models may learn to summarize their content, emulate popular styles of authors, and may even directly reproduce certain elements from the original sources.
Copyright & Fair Use
As stated in 17 U.S.C. § 106, the Copyright Act establishes that copyright owners are afforded the exclusive rights to reproduce, distribute, publicly perform, publicly display, and prepare derivatives of their original creations. Accordingly, generative AI models implicate these rights, both in the creation of training datasets (which may involve reproduction and distribution of copyrighted materials) and in the deployment of the models (when the generated outputs may reflect elements of the original works). However, the exercise of these exclusive rights does not require the copyright holder's permission if the use qualifies as "fair use" under Section 107 of the Copyright Act.
Determining whether a particular use constitutes fair use involves consideration of the four fair-use factors: (1) the purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes, (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and (4) the effect of the use upon the potential mark for or value of the copyrighted work. Thus, the legality of using copyrighted material to train and deploy generative AI models depends on how the specific uses are viewed in the context of each of the above four factors.
First factor: the purpose and character of the use
The first factor considers the purpose and character of the use and examines where the allegedly infringing use serves a different function from that of the original work. In Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith, 598 U.S. (2023), the Supreme Court emphasized that the first factor "asks 'whether and to what extent' the use at issue has a purpose or character different from the original'" (citing Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994)). A use that is transformative, meaning that it adds something new, tends to be indicative of fair use, while uses that are purely commercial or closely mimic the original may not be. Whether the copyrighted work was accessed lawfully may influence the analysis of the first factor as well.
In the Report, the USCO presents diverging viewpoints from the public regarding the first factor. Some argue that the use of copyrighted materials by generative AI models is "highly-transformative" because AI models extract statistical patterns and computational abstractions from the training data that are far removed from the original works. Furthermore, they argue that their use is necessary for the development of AI technology. Dissenters, on the other hand, compare training AI models as functionally equivalent to other forms of data compression, which simply reformat the same information without adding new meaning or purpose. In addition, they suggest the deployment of the AI model should be considered alongside its training; in that light, output generated by AI models may serve the same purpose as the original work.
In its own view, the USCO acknowledges training a generative AI model on a large and diverse dataset "will often be transformative." However, it cautions that generative AI models also may be used to generate outputs that closely resemble the original works on which they were trained. The USCO further notes that certain techniques used by generative AI models such as "retrieval-augmented generation" directly pull information from select sources, rather than the entire body of training data. These techniques are unlikely to be transformative as they generally share the same purpose as the original work. As a policy response, the USCO recommends that developers implement "deployment guardrails" to restrict outputs that replicate elements of copyrighted works.
Second factor: the nature of the copyrighted work
The second factor considers the nature of the copyrighted work, recognizing that not all works represent the same type of information. For example, some works are expressive and creative, such as cinematic films and novels, while others are factual or functional, such as news reports and computer code. In addition, factual information can only be represented in a limited number of ways. The USCO cites Harper & Row, Publishers, Inc. v. Nation Enters., 471 U.S. 539, 563 (1985), where it was stated that "[t]he law generally recognizes a greater need to disseminate factual works than works of fiction or fantasy." Thus, using copyrighted works that are factual or functional is more likely to favor a finding of fair use, whereas using expressive works weighs against it.
The USCO applies this principle to the context of generative AI, noting that AI training datasets include a wide range of sources, including both expressive and non-expressive content. Accordingly, a determination of fair use under the second factor will vary depending on the facts at hand and the exact contents of the training data.
Third factor: the amount and substantiality of the portion used in relation to the copyrighted work as a whole
The third factor examines "the amount and substantiality of the portion used in relation to the copyrighted work as whole" and asks whether the quantity taken is "reasonable in relation to the purpose of the copying" (Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 586 (1994)). The third factor is naturally related to the first factor because the amount deemed reasonable to copy often depends on the purpose of the use. It also informs the fourth factor because copying a larger fraction of the work may increase the likelihood of harming the market for the original, in particular when the use of the work makes the copyrighted work available to the public, in whole or in part.
Training generative AI models typically involves copying works in their entirety to expose the model to all the expressive elements and the structure or sequence in which those elements appear. Proponents of fair use argue that this practice is justified because exposure to full works enables the model to capture artistic patterns essential for effective learning. On this view, the use of the full work is reasonable in light of the purpose. Critics contend that this reasoning risks inverting well-established fair use principles, where copying less typically weighs in favor of fair use. They argue that accepting the necessity of copying entire works for AI training may create an incentive that is adverse to creators and encourages developers to take more by claiming that doing so better serves a functional purpose.
Currently, there is disagreement regarding the extent to which copyrighted material used during training can be reproduced and made available to the public by generative AI models. Some users contend that these models can memorize and regurgitate protected content (as alleged by Disney and Universal in a recent lawsuit against AI developer Midjourney). However, the degree of effort required to elicit such "memorized" content remains unclear, raising questions about whether such reproduction is incidental, systematic, or readily accessible by average users.
In its analysis, the USCO acknowledges that using complete copies may be necessary to train generative AI models in some specialized, domain-specific applications. More broadly, a substantially transformative use may weigh in favor of fair use, regardless of the application, especially when the model's outputs do not make protected aspects of the original works publicly available. With respect to training general-purpose models, the USCO questions whether the copying of full works is justified, given that any individual work forms only a negligible fraction of the entire set of training data. Again, to mitigate the risk of unauthorized reproduction, the USCO encourages developers to introduce preventative "guardrails" such as input filters to block prompts likely to elicit protected content and output restrictions that prevent the dissemination of material resembling copyrighted works.
Fourth factor: the effect of the use upon the potential market for or value of the copyrighted work
The fourth factor considers the "effect of the use upon the potential market for or value of the copyrighted work." In ongoing and future cases involving generative AI, this factor is likely to be a focal point of judicial analysis. In its Report, the USCO highlights that the Supreme Court has described the fourth factor as "undoubtedly the single most important element of fair use" (Harper & Row Publishers, Inc. v. Nation Enters., 471 U.S. 539, 566 (1985)). Its central importance is reflected in broader case law: in an empirical analysis of 579 fair use opinions from 453 cases spanning 1978-2019, the fourth factor was found to "dominate the test."
The USCO identifies several categories where outputs by generative AI models could negatively impact the market for copyrighted works, including lost sales, market dilution, and lost licensing opportunities. At the same time, some proponents and developers of generative AI argue that any potential threats to the market should be weighed against the broader public benefits of innovation and creation that may result from the use of copyrighted material in AI training.
The first potential harm is the loss of sales to AI-generated content that serves as a substitute in the marketplace. Critics argue that generative AI's ability to produce "word-for-word" copies of original content poses a serious threat to the market, in particular when models use retrieval augmented generation to draw from a narrow set of training materials (e.g., a handful of news articles, or a single book) to produce an output. As the USCO explains, quoting Authors Guild v. Google 804 F.3d 202 (2nd Cir. 2015), "[a] user for whom the augmented response 'satisfies the ... need' for the original work will not pay to obtain it in the marketplace." However, a key point of contention is whether such AI outputs truly replace individual copyrighted works or simply introduce new competition.
More specifically, proponents of generative AI argue that the majority of outputs are not reproductions, but rather new works that are merely "of the same type" as the originals. Because these outputs do not directly replicate protected expression, they contend that such outputs do not function as market substitutes and therefore fall outside the scope of the fourth factor. Nevertheless, if outputs from generative AI are not direct substitutes, the USCO explains that the outputs may still adversely affect the value of the original by diluting the market.
While acknowledging that market competition is a legitimate value, the USCO emphasizes that market dilution, identified as the second potential harm to the market, is still a valid harm under the fourth factor analysis. The Report cautions that "[t]he speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data... If thousands of AI-generated romance novels are put on the market, fewer of the human-authored romance novels that the AI was trained on are likely to be sold." This not only reduces the visibility of the human-created works but also dilutes royalty pools. The Report cites several examples of this trend already occurring, including in fiction literature and music.
The third potential harm to the market identified by the USCO is the loss of actual or potential licensing opportunities due to the unlicensed use of copyrighted materials. For such a harm to occur, a licensing market must already exist or be reasonably likely to develop. In recent years, these opportunities have begun to emerge in certain sectors. Several multibillion-dollar companies have entered licensing agreements specifically to secure training data for generative AI models. These include OpenAI's deals with the Associated Press and Shutterstock, Getty Images' partnership with Nvidia and Bria, and vAIsual's licensing arrangement with Rightsify. Most recently, the New York Times signed a licensing deal with Amazon to provide editorial content as training data, a move that underscores the core issue in its ongoing infringement lawsuit against OpenAI. Despite these high-profile deals, the Report notes that many content sectors lack still lack viable licensing infrastructure, and that even where licensing exists, it is often accessible to only the most well-resourced companies. Moreover, much of the training data is scraped from publicly accessible sources and authored by individuals that are difficult to identify or contact. Ultimately, the USCO concludes that the fourth factor is flexible enough to account for this spectrum of circumstances: where licensing is viable but ignored, the fourth factor will weigh against fair use; where no meaningful licensing market exists, "there is no functioning market to harm," and the fourth factor may tilt in favor fair use.
Despite the potential harms to the market outlined above, some view generative AI as providing unique public benefits that may outweigh those harms. OpenAI has suggested that generative AI will "augment human capabilities, thereby fostering human creativity," while Meta has argued that its models "bring innovative and, in some cases, potentially life-saving services and technologies to market." Whether such benefits can be realized using only datasets composed of licensed or public-domain material remains an open question, although several examples of such models exist (e.g., Adobe's Firefly image generator, Boomy's music generator, Getty Images' image generator, and Stability AI's Stable Audio music generator). The USCO does not dispute that many public benefits are made available by generative AI. Nonetheless, it concludes that these benefits do not change the balance of the fair use analysis outside of the four traditional fair use factors.
Weighing the factors
The USCO explains that the relative weight of each of the four fair use factors "will depend on the facts and circumstances of the particular case." There is no formula or surefire method for determining how the factors will apply in court. Nevertheless, the first and fourth factors are emerging as particularly influential in the context of generative AI.
With respect to the first factor, certain uses of generative AI appear plainly transformative. For example, U.S. District Judge William Alsup of San Francisco recently ruled in favor of fair use, describing Anthropic's use of copyrighted materials to train its generative AI model as "quintessentially transformative" (at the same time, ruling that the case must proceed to trial to determine whether it was fair to use pirated materials during training). As for the fourth factor, the USCO suggests that generative AI may have an "unprecedented" impact on markets for copyrighted material. District Judge Vince Chhabria, in a recent ruling in favor of Meta, likewise recognized the transformative nature of the model but cautioned that a stronger showing on market harm would have tipped the analysis. "It seems likely," he wrote, "that market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall—in cases like this." The plaintiffs, however, "barely give this issue [of market dilution] lip service," and thus the first factor was decisive in the outcome.
As more fair use cases involving generative AI proceed through the courts, a clearer picture will likely materialize regarding how the factors interact and which ones tend to carry the greatest weight in particular contexts.
Conclusion
While the doctrine of fair use as a statutory but flexible judicial framework is unique to the U.S., the country is not alone in navigating legal challenges posed by generative AI. Several jurisdictions, including the European Union member states, Japan, and Singapore, have other copyright exceptions that predate generative AI but may yet be applicable, such as those involving "text and data mining." Still, tension persists as case law abroad remains sparse and interpretations diverge. The United Kingdom has a similar exception but has yet to formally clarify its applicability to generative AI. The USCO's Report surveys additional international perspectives, including Israel's Ministry of Justice, which suggested that most instances of training AI models will be fair use, and ongoing debates within Korea and China. Other nations, such as Brazil and Spain, are considering statutory frameworks that would permit AI training in exchange for compensation to rightsholders under various forms. Taken together, these international efforts reveal no consensus on whether training AI models on copyrighted materials should be considered lawful. In any case, as global decisions emerge, they may inform and influence how U.S. copyright evolves.
Stepping back, it is clear that the fair use analysis for training and deploying generative AI models involves a range of complex and unsettled issues. Yet it is essential that courts, creators, and businesses remain well-informed, as the outcomes of these legal determinations will carry significant implications for innovation, artistic markets, and commercial operations. Where infringement is not found, creators and rightsholders may need to explore new strategies to avoid the potential market harms described above. Conversely, where infringement is found, businesses will be forced to respond by seeking licensing arrangements with rightsholders in order to operate lawfully within the relevant jurisdictions. Some commentators have warned that this could lead to anti-competitive outcomes and "entrench power in the largest and best-resourced companies and content owners." The USCO disagrees, noting that "[l]icensing will always be easier for those with deeper pockets." Moreover, there are existing antitrust laws that may be relied on where anticompetitive behavior is in question. Accordingly, the Report argues, the mere difficulty of obtaining a licensing arrangement where one is available should not alter the fair use analysis.
The Report advises that American leadership will be best furthered by both supporting the innovation of generative AI and the protection of human-authored works. To conclude, the USCO observes the following:
Various uses of copyrighted works in AI training are likely to be transformative. The extent to which they are fair, however, will depend on what works were used, from what source, for what purpose, and with what controls on the outputs—all of which can affect the market. When a model is deployed for purposes such as analysis or research—the types of uses that are critical to international competitiveness—the outputs are unlikely to substitute for expressive works used in training.
But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.
At present, the Office continues, government intervention to mandate licensing schemes or statutorily authorize training AI models with copyrighted materials is unnecessary. Instead, it maintains that the existing fair use framework is sufficient to address these issues and that courts should continue to evaluate cases under the traditional four factors. Licensing markets, it suggests, should be free to develop organically in response to legal and commercial pressures.
It is important to emphasize that the Report offers guidance and opinion to the public but does not carry the force of the law or directly affect judicial interpretation. However, case law in this area is poised to develop rapidly, with many fair use cases unfolding in real time. Consequently, we will continue to monitor developments closely and communicate significant outcomes as they take form.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.