Protecting Your Company Against Copyright and Privacy Risk in the Era of Large Language Models

By Andrew Pery, AI Ethics Evangelist, ABBYY

Since ChatGPT and other large language models (LLMs) gained traction, concerns about copyright and privacy risk have become increasingly urgent. This risk is compounded by the fact that many employees admit to using generative AI tools at work without formal approval or governance, creating hidden exposure for organizations.

One of the most significant areas of risk lies in documents.

Across industries, documents remain the backbone of business operations. They contain contracts, compliance records, medical information, financial histories, personal identifiers, trade secrets, and copyrighted works. As organizations modernize document workflows, they are increasingly looking to augment traditional automation with the reasoning capabilities of LLMs, the contextual intelligence of retrieval-augmented generation (RAG), and even agentic AI. The challenge is to integrate these capabilities without introducing unacceptable copyright or privacy liabilities.

How LLMs Introduce Risk

LLMs are trained by ingesting vast amounts of text – from books, articles, websites—and converting that information into numerical representations known as embeddings. These embeddings can introduce legal and privacy risk in two ways:

Internally, through representations created during model training and inference, such as token or word embedding; and
Externally, when embeddings are stored in vector databases to support semantic search or RAG workflows, often derived from customer content, proprietary documents, or user prompts.

Recent legal and policy work emphasizes that embeddings are not just harmless math. In some cases, they may be reverse-engineered or exploited to infer underlying content, exposing organizations to copyright infringement or privacy violations.

Copyright Law and Fair Use

U.S. copyright law allows certain uses of copyrighted material under the doctrine of ‘fair use’. In Authors Guild v. Google, Inc., the Second Circuit held that Google’s digitization of books to make them searchable constituted fair use because it was transformative and did not substitute for the originals.

However, the Supreme Court narrowed this reasoning in Goldsmith v. Andy Warhol Foundation, emphasizing that courts must evaluate each specific use and consider whether it competes with or harms the market for the original work. Transformation alone is not sufficient. These distinctions are increasingly relevant for AI.

The New Wave of AI Copyright Litigation

Courts are now testing how fair use applies to AI training and outputs. In Bartz et al. v. Anthropic PBC, a federal judge suggested that training models on copyrighted books could qualify as fair use in principle but allowed the case to proceed based on concerns that output-level memorization or regurgitation could constitute derivative works.

The subsequent $1.5 billion settlement—the largest copyright settlement in U.S. history—highlighted how quickly the legal landscape is shifting.

In December 2025, a new, non–class-action lawsuit led by Bad Blood author John Carreyrou targeted six major AI companies, challenged the foundational premise that fair use should extend to training, especially where companies knowingly ingested infringing datasets and have generated enormous commercial value from models without adequate compensation

At the center of these cases is the fourth fair-use factor: market harm. Courts are increasingly acknowledging that AI systems are not merely indexes or search tools, but generative systems whose outputs may contain copyrighted text, replicate proprietary styles, or substitute for original works.

So where does this leave companies? With so many legal implications, business leaders must consider, is training data more akin to copying to build a derivative artifact, which may be infringing? Are embeddings “copies”? Do AI outputs cause market harm when AI substitutes for original journalism or books?

A Safe Architectural Approach

General-purpose AI tools such as LLMs were not designed with these risks in mind. But, purpose-built Document AI differs both philosophically and technically from generative models.

First, they avoid unlicensed, scraped training data. While Foundational AI models rely on massive datasets gathered from the open internet, some of it licensed, much of it not, purpose-built AI takes the opposite approach: they build models using controlled datasets, synthetic data, or licensed corpora. There’s no mystery about where the training material came from, and no risk that the system learned from pirated books or proprietary documents.

Second, they don’t retain customer documents for model improvement. In general-purpose AI, anything a customer uploads may become part of the next model. With purpose-built Document AI, the workflow is explicitly segmented: data goes in, fields come out, and the underlying documents are not absorbed into a global model.

Third, they support on-premise and private-cloud deployment. Many enterprises handle data that cannot legally leave their infrastructure. Purpose-built Document AI solutions let them keep full control, avoiding the security gaps and compliance risks that come with sending documents to third-party servers.

Fourth, they minimize or eliminate persistent embeddings. Embeddings are one of the biggest drivers of legal uncertainty because they can encode personal or expressive information. Document AI often bypasses or tightly controls embedding creation. When embeddings are used for classification or semantic matching, they’re isolated to the customer’s own environment and not co-mingled across tenants.

Fifth, they don’t generate new copyrighted output. Generative AI models can accidentally reproduce training content word-for-word. Document AI doesn’t generate text—it extracts. That eliminates one of the biggest sources of infringement risk.

Protecting Privacy by Design

While copyright gets more attention in headlines, the privacy issues are just as urgent. In many industries, document processing directly exposes AI systems to some of the most sensitive information a company handles.

Purpose-built Document AI reduces this risk through design choices that make privacy protection easier and more reliable.

Data minimization is built into the workflow. Instead of keeping full documents, these systems can retain only the fields that matter—amounts, dates, IDs, addresses—discarding the wider document context. That drastically reduces the exposure if a breach occurs.

Field-level redaction and pseudonymization, such as names, account numbers, birthdates, and other personal identifiers, can be redacted or hashed automatically before the data moves into downstream systems.

If embeddings are generated at all, they remain locked inside customer-specific environments. Attackers cannot probe the model to discover whether an individual’s data was used.

Document AI systems tend to include full tracking of who accessed a document, when, and for what purpose. This satisfies regulators’ expectations for accountability and helps organizations demonstrate compliance.

Best Practices for Safe AI Deployment

Even with a safer toolset, responsible implementation matters. Organizations adopting Document AI should embrace several practical safeguards:

Keep training separate from customer data. Never let customer documents flow into general models unless you have explicit, documented permission.
Encrypt everything, including intermediate representations. Attackers increasingly target vector databases.
Limit retention windows. If documents or embeddings aren’t needed after extraction, delete them.
Use redaction by default. Especially for identity data, health information, or payment details.
Maintain a clear data lineage. Track where documents came from, how they were processed, and where output is stored.
Prepare for erasure and unlearning requests. Privacy laws increasingly require the ability to remove personal data from derived artifacts, not just raw documents.

A Different Kind of Intelligence

As generative AI reshapes work, it’s tempting to try to solve every problem with the same model. But the legal landscape is making one thing clear: when dealing with sensitive or copyrighted documents, a different kind of intelligence is needed.

Purpose-built Document AI avoids the pitfalls of general-purpose models by design. It processes documents without absorbing them. It extracts information without learning more than it needs. It keeps data isolated rather than blending it into global models. And it equips organizations with the guardrails required to meet evolving copyright and privacy standards.

In a world where the rules of AI are still taking shape, organizations cannot afford guesswork. They need tools designed for compliance, not just performance. They need systems that treat documents with the care the law requires and the caution reality demands.

By combining rights-safe training, privacy-by-design features, controlled embeddings, and robust security frameworks, such systems significantly reduce the exposure to both copyright infringement and data-privacy violations while retaining operational efficiency.

Andrew Pery

AI Ethics Evangelist

ABBYY

March 11, 2026

Andrew Pery is an AI Ethics Evangelist at global intelligent automation company ABBYY and is a certified Data Privacy Professional and a certified AI Auditor. Andrew has more than 25 years of experience spearheading tech management programs for leading global technology companies. His expertise is in intelligent automation with particular expertise in AI governance, data privacy and AI ethics. He holds a Master of Laws degree with Distinction from Northwestern University Pritzker School of Law and is a member of the American Bar Association and the International Association of Privacy Professionals.

If you wish to showcase your experience and expertise, participate in industry-leading discussions, and add visibility and impact to your personal brand and business, get in touch with the Techronicler team to feature in our fast-growing publication.

Individual Contributors:

Answer our latest queries and submit your unique insights:
https://bit.ly/SubmitBrandWorxInsight

Submit your article:
https://bit.ly/SubmitBrandWorxArticle

PR Representatives:

Answer the latest queries and submit insights for your client:
https://bit.ly/BrandWorxInsightSubmissions

Submit an article for your client:
https://bit.ly/BrandWorxArticleSubmissions

Please direct any additional questions to: connect@brandworx.digital