Our Pragmatic Auto-Translator challenges the dominant approach to both human and machine translation, where content is divided up and translated sentence-by-sentence using opaque automatic translation systems. We're also working from a fundamental question: How do people actually write?
Writing is a non-linear process of drafting and revising. Writers don't create a table and write each sentence in its own cell, yet this is exactly how most translation technology works today. We believe this mismatch is at the heart of many translation quality problems.
Pragmatics in translation serves as a measure of cohesion on two crucial levels:
Achieving pragmatically appropriate translation requires refashioning ideas so they align with both document-level expectations (how this type of text typically works) and cultural expectations (how this audience thinks and communicates). While good writing will sometimes break the rules intentionally, literal sentence-by-sentence translation breaks them unintentionally, and far too often to achieve true pragmatic appropriateness.
Our automatic translation system embraces two major paradigm shifts.
We deliberately exclude translations from our corpora because of inherent characteristics identified in Baker's research on translation tendencies that:
Most publicly available translations were also produced sentence-by-sentence, imposing source language structures onto target languages. This creates writing that doesn't reflect how content is naturally produced in the destination language.
We curate specialized corpora that reflect how people actually write in specific contexts. From this knowledge, we create vectors at three levels:
Once we have those vectors, they are integrated into the automatic translation process through RAG-based querying. The source text is also vectorized, compared to the corpus, and relevant content is passed to Deepseek to inform LLM-generated translation.
We believe that curation is key. Following Bender and Gebru's research, we recognize that the opacity of training data in large language models amplifies problematic biases, misinformation, and English-centric perspectives. Carefully selecting and transparently citing our corpus sources is therefore fundamental to our approach.
We're building toward our vision iteratively. This iteration takes a RAG-based approach. Our R&D roadmap includes incorporating terminological knowledge graphs that ground LLMs in how humans organize ideas conceptually, incorporating a translation quality evaluation system, and fine tuning language models.
We're starting small to build thoughtfully and move away from harmful industry trends while developing our pragmatic approach.
We'll be presenting our findings on translation quality gains at the HCI International 2026 conference in July. Keep a lookout for the research publications that result from our participation in that conference.