Pragmatic Auto-Translator

Our Pragmatic Auto-Translator challenges the dominant approach to both human and machine translation, where content is divided up and translated sentence-by-sentence using opaque automatic translation systems. We're also working from a fundamental question: How do people actually write?

Writing is a non-linear process of drafting and revising. Writers don't create a table and write each sentence in its own cell, yet this is exactly how most translation technology works today. We believe this mismatch is at the heart of many translation quality problems.

What makes a translation pragmatically appropriate?

Pragmatics in translation serves as a measure of cohesion on two crucial levels:

Local cohesion: Is the translation cohesive as a complete piece of content for the destination culture? Does information flow naturally and meet audience expectations for that type of text?
Global cohesion: Does the translation fit with the broader knowledge, needs, and cultural expectations of the destination audience?

Achieving pragmatically appropriate translation requires refashioning ideas so they align with both document-level expectations (how this type of text typically works) and cultural expectations (how this audience thinks and communicates). While good writing will sometimes break the rules intentionally, literal sentence-by-sentence translation breaks them unintentionally, and far too often to achieve true pragmatic appropriateness.

Our Corpus-Informed Approach

Our automatic translation system embraces two major paradigm shifts.

Whole-document translation: Moving beyond sentence-by-sentence processing to consider entire documents as cohesive units
Corpus-informed methodology: Using authentic writing samples rather than translation-informed training data

Why We Avoid Translation Data

We deliberately exclude translations from our corpora because of inherent characteristics identified in Baker's research on translation tendencies that:

Make implicit information overly explicit, eliminating nuance
Use unnecessarily simplified language
Mimic source language structures to an exaggerated degree
Take a "safe middle road" between free and literal translation

Most publicly available translations were also produced sentence-by-sentence, imposing source language structures onto target languages. This creates writing that doesn't reflect how content is naturally produced in the destination language.

How We Use Our Specialized Corpora

We curate specialized corpora that reflect how people actually write in specific contexts. From this knowledge, we create vectors at three levels:

Document-level vectors: Used to analyze text organization and discourse patterns
Section-level vectors: Used to identify relevant passages for context
Paragraph-level vectors: Used for concept and terminology matching

Once we have those vectors, they are integrated into the automatic translation process through RAG-based querying. The source text is also vectorized, compared to the corpus, and relevant content is passed to Deepseek to inform LLM-generated translation.

Transparent and Ethical Development

We believe that curation is key. Following Bender and Gebru's research, we recognize that the opacity of training data in large language models amplifies problematic biases, misinformation, and English-centric perspectives. Carefully selecting and transparently citing our corpus sources is therefore fundamental to our approach.

Current Implementation

We're building toward our vision iteratively. This iteration takes a RAG-based approach. Our R&D roadmap includes incorporating terminological knowledge graphs that ground LLMs in how humans organize ideas conceptually, incorporating a translation quality evaluation system, and fine tuning language models.

We're starting small to build thoughtfully and move away from harmful industry trends while developing our pragmatic approach.

We'll be presenting our findings on translation quality gains at the HCI International 2026 conference in July. Keep a lookout for the research publications that result from our participation in that conference.