Gen-AI tools are trained using materials (text, images, audio, video, code) which are protected by copyright. Permission is usually required to copy third-party owned works unless the work is out of copyright or a licence or exception applies. Use of third-party owned copyright works to train Gen-AI tools without permission from copyright owners could infringe copyright.
There are several ongoing legal actions involving rightsholders suing technology companies for use of copyright works in training Gen-AI tools. For example, Getty Images (provider of stock photographs) sued Stability AI for copyright infringement, database right infringement, trade mark infringement, and passing off, alleging that Stability AI scraped millions of images from the Getty website without consent, and used these images to train and develop Stable Diffusion. [Getty Images v Stability AI [2023] EWHC 3090 (Ch)] New York Times has also sued OpenAI and Microsoft [New York Times v Microsoft and Open AI (link to PDF document)].
The other point to consider is the potential for the Gen-AI tool you are using to generate output which is similar to a protected work by a real person and the likelihood of this giving rise to a claim for copyright infringement. Getty Images’ case against Stability AI claims that images generated by Stable Diffusion infringe upon Getty Images’ copyrighted works and bear their trade marks.
An exception in UK copyright law permits text and data mining (TDM) for a non-commercial purpose. Section 29A Copyright, Designs and Patents Act 1988 (CDPA) provides that the making of a copy of a work by a person who has lawful access to the work does not infringe copyright provided the copy is made to carry out a computational analysis of anything recorded in the work for the sole purpose of research for a non-commercial purpose and the copy is accompanied by a sufficient acknowledgement (unless this is impossible for reasons of practicality or otherwise). The exception includes a no contractual override clause which means that where a term of a contract or licence prevents use of the exception that term is unenforceable. UK law does not currently permit text and data mining for commercial purposes.
The exception could apply to training Gen-AI models where carried out for non-commercial research purposes so is helpful to researchers developing Gen-AI models within the University. Difficulty arises where the Gen-AI model is commercialised. Permission or a licence is likely to be required to train Gen-AI models for a commercial purpose.
It is worth noting that technical protection measures (TPMs) applied by rightsholders to protect their content can present obstacles to using the exception. For example, it may not be possible to download the volume of published research articles required to train a Gen-AI model due to TPMs restricting copying to a specified amount. Restrictions in publisher's licence terms enforced by systems to flag excessive downloading can result in users being blocked.
The TDM exception does not expressly permit circumventing TPMs to take advantage of the exception. The Intellectual Property Office website states that publishers and content providers will be able to apply reasonable measures to maintain their network security or stability, but these measures should not prevent or unreasonably restrict a researcher's ability to text and data mine. However, it does not explain what you can do if these measures restrict your ability to data mine other than filing a request to the Secretary of State to remove a TPM as provided in the CDPA.
Some commercial publishers facilitate text and data mining via licensing and whether there is a cost involved depends on whether the TDM is being done for commercial or non-commercial purposes. Elsevier have adopted a license-based approach which automatically enables researchers at subscribing institutions to text mine for non-commercial research purposes and to gain access to full text content in XML for this purpose. However, obtaining licence keys can be complex and time consuming and it may be preferable to use an open source like CORE as a source for training data. It is also possible to use the CORE datasets for a commercial purpose subject to CORE’s terms and conditions.