© All rights reserved. Powered by Techronicler
By Andrew Pery, AI Ethics Evangelist, ABBYY

Since ChatGPT and other large language models (LLMs) gained traction, concerns about copyright and privacy risk have become increasingly urgent. This risk is compounded by the fact that many employees admit to using generative AI tools at work without formal approval or governance, creating hidden exposure for organizations.
One of the most significant areas of risk lies in documents.
Across industries, documents remain the backbone of business operations. They contain contracts, compliance records, medical information, financial histories, personal identifiers, trade secrets, and copyrighted works. As organizations modernize document workflows, they are increasingly looking to augment traditional automation with the reasoning capabilities of LLMs, the contextual intelligence of retrieval-augmented generation (RAG), and even agentic AI. The challenge is to integrate these capabilities without introducing unacceptable copyright or privacy liabilities.
LLMs are trained by ingesting vast amounts of text – from books, articles, websites—and converting that information into numerical representations known as embeddings. These embeddings can introduce legal and privacy risk in two ways:
Recent legal and policy work emphasizes that embeddings are not just harmless math. In some cases, they may be reverse-engineered or exploited to infer underlying content, exposing organizations to copyright infringement or privacy violations.
U.S. copyright law allows certain uses of copyrighted material under the doctrine of ‘fair use’. In Authors Guild v. Google, Inc., the Second Circuit held that Google’s digitization of books to make them searchable constituted fair use because it was transformative and did not substitute for the originals.
However, the Supreme Court narrowed this reasoning in Goldsmith v. Andy Warhol Foundation, emphasizing that courts must evaluate each specific use and consider whether it competes with or harms the market for the original work. Transformation alone is not sufficient. These distinctions are increasingly relevant for AI.
Courts are now testing how fair use applies to AI training and outputs. In Bartz et al. v. Anthropic PBC, a federal judge suggested that training models on copyrighted books could qualify as fair use in principle but allowed the case to proceed based on concerns that output-level memorization or regurgitation could constitute derivative works.
The subsequent $1.5 billion settlement—the largest copyright settlement in U.S. history—highlighted how quickly the legal landscape is shifting.
In December 2025, a new, non–class-action lawsuit led by Bad Blood author John Carreyrou targeted six major AI companies, challenged the foundational premise that fair use should extend to training, especially where companies knowingly ingested infringing datasets and have generated enormous commercial value from models without adequate compensation
At the center of these cases is the fourth fair-use factor: market harm. Courts are increasingly acknowledging that AI systems are not merely indexes or search tools, but generative systems whose outputs may contain copyrighted text, replicate proprietary styles, or substitute for original works.
So where does this leave companies? With so many legal implications, business leaders must consider, is training data more akin to copying to build a derivative artifact, which may be infringing? Are embeddings “copies”? Do AI outputs cause market harm when AI substitutes for original journalism or books?
General-purpose AI tools such as LLMs were not designed with these risks in mind. But, purpose-built Document AI differs both philosophically and technically from generative models.
First, they avoid unlicensed, scraped training data. While Foundational AI models rely on massive datasets gathered from the open internet, some of it licensed, much of it not, purpose-built AI takes the opposite approach: they build models using controlled datasets, synthetic data, or licensed corpora. There’s no mystery about where the training material came from, and no risk that the system learned from pirated books or proprietary documents.
Second, they don’t retain customer documents for model improvement. In general-purpose AI, anything a customer uploads may become part of the next model. With purpose-built Document AI, the workflow is explicitly segmented: data goes in, fields come out, and the underlying documents are not absorbed into a global model.
Third, they support on-premise and private-cloud deployment. Many enterprises handle data that cannot legally leave their infrastructure. Purpose-built Document AI solutions let them keep full control, avoiding the security gaps and compliance risks that come with sending documents to third-party servers.
Fourth, they minimize or eliminate persistent embeddings. Embeddings are one of the biggest drivers of legal uncertainty because they can encode personal or expressive information. Document AI often bypasses or tightly controls embedding creation. When embeddings are used for classification or semantic matching, they’re isolated to the customer’s own environment and not co-mingled across tenants.
Fifth, they don’t generate new copyrighted output. Generative AI models can accidentally reproduce training content word-for-word. Document AI doesn’t generate text—it extracts. That eliminates one of the biggest sources of infringement risk.
While copyright gets more attention in headlines, the privacy issues are just as urgent. In many industries, document processing directly exposes AI systems to some of the most sensitive information a company handles.
Purpose-built Document AI reduces this risk through design choices that make privacy protection easier and more reliable.
Data minimization is built into the workflow. Instead of keeping full documents, these systems can retain only the fields that matter—amounts, dates, IDs, addresses—discarding the wider document context. That drastically reduces the exposure if a breach occurs.
Field-level redaction and pseudonymization, such as names, account numbers, birthdates, and other personal identifiers, can be redacted or hashed automatically before the data moves into downstream systems.
If embeddings are generated at all, they remain locked inside customer-specific environments. Attackers cannot probe the model to discover whether an individual’s data was used.
Document AI systems tend to include full tracking of who accessed a document, when, and for what purpose. This satisfies regulators’ expectations for accountability and helps organizations demonstrate compliance.
Even with a safer toolset, responsible implementation matters. Organizations adopting Document AI should embrace several practical safeguards:
As generative AI reshapes work, it’s tempting to try to solve every problem with the same model. But the legal landscape is making one thing clear: when dealing with sensitive or copyrighted documents, a different kind of intelligence is needed.
Purpose-built Document AI avoids the pitfalls of general-purpose models by design. It processes documents without absorbing them. It extracts information without learning more than it needs. It keeps data isolated rather than blending it into global models. And it equips organizations with the guardrails required to meet evolving copyright and privacy standards.
In a world where the rules of AI are still taking shape, organizations cannot afford guesswork. They need tools designed for compliance, not just performance. They need systems that treat documents with the care the law requires and the caution reality demands.
By combining rights-safe training, privacy-by-design features, controlled embeddings, and robust security frameworks, such systems significantly reduce the exposure to both copyright infringement and data-privacy violations while retaining operational efficiency.

Andrew Pery is an AI Ethics Evangelist at global intelligent automation company ABBYY and is a certified Data Privacy Professional and a certified AI Auditor. Andrew has more than 25 years of experience spearheading tech management programs for leading global technology companies. His expertise is in intelligent automation with particular expertise in AI governance, data privacy and AI ethics. He holds a Master of Laws degree with Distinction from Northwestern University Pritzker School of Law and is a member of the American Bar Association and the International Association of Privacy Professionals.
If you wish to showcase your experience and expertise, participate in industry-leading discussions, and add visibility and impact to your personal brand and business, get in touch with the Techronicler team to feature in our fast-growing publication.
Individual Contributors:
Answer our latest queries and submit your unique insights:
https://bit.ly/SubmitBrandWorxInsight
Submit your article:
https://bit.ly/SubmitBrandWorxArticle
PR Representatives:
Answer the latest queries and submit insights for your client:
https://bit.ly/BrandWorxInsightSubmissions
Submit an article for your client:
https://bit.ly/BrandWorxArticleSubmissions
Please direct any additional questions to: connect@brandworx.digital