- within Technology, Intellectual Property and Environment topic(s)
Generative artificial intelligence applications have become increasingly popular, raising critical questions about the permissibility of text and data mining for training large language models. AI training requires vast amounts of data, often including copyright-protected works from various rightsholders, creating concerns about unauthorised reproduction and potential copyright infringements.
In 2023, Finland implemented Section 13 b of the Copyright Act (404/1961), transposing the text and data mining provisions of the DSM Directive ((EU) 2019/790). The exception permits text and data mining when (i) the miner has lawful access to the work and (ii) the right of reproduction has not been expressly and appropriately reserved by the rightholder. "Lawful access" refers to access granted through an open access policy, contractual arrangements, or content freely available online.
Rightholders can implement opt-out mechanisms through machine-readable means, such as metadata or website terms and conditions. The exception applies only where copies are made for text and data mining purposes. Text and data mining is defined in the DSM Directive as "any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations".
This post examines recent developments in EU case law addressing text and data mining for AI training.
CJEU Preliminary Ruling Pending
The most significant recent development is the request for a preliminary ruling lodged on 3 April 2025 with the Court of Justice of the European Union in Case C-250/25, Like Company v Google Ireland Limited. This case directly addresses key legal questions surrounding AI training and copyright.
A Hungarian court asks whether displaying text partially identical to press publishers' web pages in LLM-based chatbot responses constitutes communication to the public under Article 15 of the DSM Directive, where the text length attracts copyright protection. The court further queries whether it is legally relevant that responses result from word prediction based on observed patterns.
The court also seeks clarification on whether training an LLM-based chatbot constitutes reproduction under Article 15(1) of the DSM Directive and Article 2 of Directive 2001/29, where the LLM is built on pattern observation and matching that enable the model to recognise linguistic patterns. If training constitutes reproduction, the court asks whether this reproduction of lawfully accessible works falls within the text and data mining exception in Article 4 of the DSM Directive.
Regarding AI outputs, the court asks whether Article 15(1) of the DSM Directive and Article 2 of Directive 2001/29 mean that, where a user instructs an LLM-based chatbot in a manner that matches or refers to text contained in a press publication, and the chatbot generates a response displaying part or all of that content, this constitutes reproduction on the part of the chatbot service provider.
These questions are crucial because they address separate liability issues: even if training is permitted under the text and data mining exception, generating outputs that reproduce or communicate copyrighted material may constitute infringement. The Court must also determine whether the technical nature of LLM response generation through word prediction has legal significance.
This case represents the first time the CJEU will directly address whether LLM training falls within the text and data mining exception. The Court's answers will be binding across all EU Member States, including Finland, and will clarify the legal landscape that has remained uncertain since the DSM Directive's implementation.
National Court Insights
Whilst awaiting the CJEU's ruling, recent Dutch and German decisions provide preliminary guidance, though with limited jurisdictional scope.
The Dutch Approach: Strict Opt-Out Requirements
In DPG Media et al v HowardsHome, decided by the Amsterdam District Court (30 October 2024), the text and data mining exception was successfully invoked. The court ruled that the plaintiffs had not expressly and appropriately reserved the right to deny text and data mining on their websites. The reservation only targeted "big AI bots", meaning the defendant's bot was not covered, rendering the opt-out ineffective.
The judgment establishes that opt-out mechanisms must explicitly identify their targets. Implied or generic reservations are insufficient and clear and express reservations are required.
The German Approach: Machine-Readability and AI Training Boundaries
In Kneschke v LAION, the Higher Regional Court of Hamburg (10 December 2025) dismissed a photographer's appeal against a first-instance judgment of the Hamburg District Court (27 September 2024). The appeal court clarified the first-instance decision on several key points.
The Higher Regional Court confirmed that downloading images to compare them with pre-existing descriptions constitutes reproduction for text and data mining purposes, as it involves automated analysis to obtain information about patterns, trends, and correlations. Both courts agreed the use qualified under the scientific research exception, as LAION is a non-profit organisation conducting research for future knowledge gain.
Regarding the opt-out mechanisms, the first-instance court suggested that natural language reservations could be considered "machine-readable" if technologies capable of detecting them existed. The Higher Regional Court rejected this approach, holding that an opt-out must be capable of being "interpreted by machines", not merely detected. The claimant failed to demonstrate that the reservation met this standard in 2021.
Critically, the Higher Regional Court expressly limited its reasoning to preparatory measures prior to actual AI training, deliberately avoiding the question of whether subsequent training of generative AI models falls within the text and data mining exception. This fundamental issue is now before the CJEU.
The Munich Court Decision: Memorisation Beyond Text and Data Mining
In the Munich Regional Court's ruling GEMA v OpenAI (11 November 2025), Germany's music collecting society successfully challenged OpenAI's use of protected song lyrics in training ChatGPT.
GEMA alleged that OpenAI used lyrics from nine well-known German songs when training the GPT-4 and GPT-4o models without a licence. The court found that simple prompts such as "How is the text of [song title]" led ChatGPT to reproduce substantial lyric parts almost verbatim.
The court held that AI training constitutes "reproduction" under German copyright law. Relying on IT research, it accepted that training data can become embedded in model weights and remain retrievable through "memorisation". The court rejected OpenAI's argument that identifying specific, definable data within the model was necessary. Instead, it held that the model's ability to generate statistically probable sequences that recognisably reproduce the lyrics was sufficient to constitute fixation under EU law.
Whilst the court confirmed that training large language models generally falls within the text and data mining exceptions, it found that memorising the disputed song lyrics exceeded this scope. The court distinguished between evaluating abstract information such as syntactic rules and semantic relationships (which constitutes text and data mining) and memorising specific protected works (which does not).
The court placed responsibility on OpenAI, which selected the training data and operated the system. OpenAI has announced plans to appeal, meaning the judgment is not yet final.
Conclusion
The legal uncertainty surrounding text and data mining for AI training is a broader EU-wide challenge. The pending CJEU case will provide much-needed clarity on whether LLM training constitutes reproduction and falls within the Article 4 exception. National courts will be bound by the CJEU's interpretation, which should resolve whether the text and data mining exception permits AI training or whether such activities require rightholder authorisation.
This forthcoming ruling will fundamentally shape AI development and determine which materials can legally be used for training purposes across the European Union.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.