Docs / Translation

English is the normalized index language, not the only language that matters.

Disaster Clippy is designed to preserve original-language material while normalizing searchable text layers into English for the main embedding and indexing path.

Core rule

The system treats English as the core language for the primary embedding and indexing layer. That keeps retrieval behavior more uniform across mixed-source collections and avoids fragmenting the main search space into disconnected language silos.

What gets preserved

Using English as the normalized index language does not mean other languages are discarded. The intended model is to keep multiple text layers whenever possible:

  • original-language text
  • translated English text
  • lineage back to the original source or timed segments where relevant

Why normalize to English

English is the current bridge language for the main search stack. A shared normalized embedding layer makes the retrieval side simpler and more consistent while still allowing multilingual inputs and outputs.

How language packs fit in

Language packs are how the system moves between the original language and the normalized English layer. The general pipeline is meant to be:

  1. Acquire text in the original language
  2. Preserve that original text as a durable layer
  3. Translate into English when needed
  4. Chunk and index from the English-normalized text
  5. Continue exposing the original language alongside it where useful

What this means for users

A user should eventually be able to work with English and non-English material in the same system. The main retrieval layer may be English-centered, but the content itself should remain multilingual and inspectable.

What this means for video

The same principle applies to transcripts. A video can have an original-language transcript, an English translated transcript, and English index chunks derived from that translated layer while still preserving the original source text.