Unbabel supports the translation of a large number of file formats. Different formats have different structures, rules and properties, which created different requirements when it comes to select which content needs to be translated and which one does not. This is done through Filter configurations. These configurations also helps the translation pipeline to deliver a file that respects and resembles the format of the original source.

These configurations will behave differently according to the file type being translated and its complexity. While simpler files such as plain text (txt) will mostly see the full extent of the text content extracted (i.e., selected for translation), translated and delivered, formats such as docx or or xlsx have more complex rules. Understanding these rules guarantees you can translate a file that is properly formatted and that you are in full control of the content that is being changed during translation. The configurations properties also enable you to manipulate the source file in ways you can find convenient - such as preventing specific content from being translated (protected content).

Below we list all the supported file formats and the properties of their standard filter configurations. When more information is available, click on the file format name to see additional information.

Note: when a format support placeholders, these cannot be formatted in any way (by style, colour, or other changes).

Supported file formats

CSV (Comma Separated Values)

Note: More information about the CSV filter is available in this article.

The filter extracts all table data from all columns. Generic placeholders are protected, as well as embedded HTML.

Technical information:

field delimiter is the comma ‘,’
text qualifier is the double quote “”
csv escaping mode - duplicates qualifier
excludes qualifiers from extracted text
excludes leading/trailing white spaces from the extracted text
adds qualifiers to output when appropriate
extraction mode - extracts table data
table properties - values start at line 1 (no column with names)
extracts data from all columns
the number of columns is defined by values (may vary in different rows)
allows trimming of leading/trailing spaces and tabs
converts \t, \n , \\ and \uXXXX into characters
separates lines with line-feeds (\n)
includes okf_html@FP-subfilter-default
protects generic placeholders

DITA (Darwin Information Typing Architecture)

The filter accepts only well formed xml documents (which adhere to specific dita syntax rules). Generic placeholders are protected.

Technical information:

assumes the document is well formed
preserves white space
uses codeFinder to protect generic placeholders

DITAMAP (Darwin Information Typing Architecture Map)

The filter accepts only well formed documents (which adhere to specific syntax rules).

Technical information:

assumes the document is well formed
lists elements and attributes for translation

DOCM (Microsoft Word)

The filter extracts everything except document properties, comments, graphical metadata. It automatically accepts revisions if they are present in the document.

Technical information:

does not extract document properties and comments
translates headers and footers
excludes graphical metadata
automatically accepts revisions and extracts their content
includes styles and highlights

DOCX (Microsoft Word)

The filter extracts everything except document properties, comments and graphical metadata.

extracts headers and footers
excludes graphical metadata
includes HTML subfilter
generic placeholders are not supported
Note that our filter cannot propagate tracked formatting changes. All tracked changes must be accepted before submitting the source file, or the output will be corrupted.

DTD (Document Type Definition XML)

The filter is intended to process XML-DTD that have translatable text entity declarations.

HTML/HTM (HyperText Markup Language)

The filter extracts all content from the file but tags are not translated. Generic placeholders are protected. Content inside any <pre> element is excluded from translation.

Technical information:

protects generic placeholders
excludes content inside <pre> elements

ICML (InCopy Markup Language)

The filter extracts all content from the file.

Technical information:

extracts master spreads
simplifies inline codes where possible
uses codeFinder for tag protection

IDML (InDesign Markup Language)

The filter extracts all content from the file, except for XML structures.

Technical information:

does not untag XML structures (the filter cannot put the tags back. This needs to be done in a manual DTP process, which depending on the size of the file might be an issue)
extracts master spreads

JSON (JavaScript Object Notation)

The filter extracts all values. Embedded HTML and generic placeholders are protected.

Technical information:

extracts all key/string pairs
extracts strings without associated key
uses key as resname
has html subfilter which deals with embedded html and protects generic placeholders

Markdown/MD (Markdown)

The filter extracts all content from the file. Embedded html and generic placeholders are protected, except placeholders of type #company and [checkout_date], as # and [...] are part of markdown syntax.

Technical information:

translates fenced code-blocks
translates inline code blocks
translates YAML metadata header
translates image alt text
placeholders are protected as inline codes. For this configuration placeholders of type #company and [checkout_date] are not protected as # and [...] are part of markdown syntax.
uses the default embedded HTML filter configuration tailored for the Markdown filter (no html subfilter is needed)

MIF (Adobe FrameMaker Interchange format)

The filter extracts variables, index markers, body pages and master pages.

Technical information:

extracts variables
extracts index markers
extracts body pages
extracts master pages
inline code protection for fonts

MQXLIFF (XML Localization Interchange File Format)

The filter extracts all content from the file. Generic placeholders are protected.

Technical information:

adds the target language attribute if not present
segments only if the input text is segmented
includes ITS markup
balances codes
uses a custom xml stream parser
protects generic placeholders

MXLIFF (XML Localization Interchange File Format)

The filter extracts all content from the file. Generic placeholders are protected.

Technical information:

adds the target language attribute if not present
segments only if the input text is segmented
includes ITS markup
balances codes
uses a custom xml stream parser
sets Finished segments as translate=”no”
protects generic placeholders

ODP (OpenDocument (Ver 2) Presentation)

The filter extracts everything from the file. All the different embedded files are treated as sub-documents by the filter. This means that, for example, when represented in XLIFF, a single ODT extracted to a single XLIFF document is made up three XLIFF <file> elements: One for content.xml, one for style.xml, and one for meta.xml. Note that very often, only content.xml has extracted text.

Technical information:

extracts everything

ODS (OpenDocument (Ver 2) Spreadsheet)

Technical information:

extracts everything

ODT (OpenDocument (Ver 2) Text Document)

Technical information:

extracts everything

OTS (OpenDocument (Ver 2) Spreadsheet)

Technical information:

extracts everything

PO (Portable Object)

The filter treats the file as bilingual - it extracts the content of "msgid" and places the translation in "msgstr". Generic placeholders are protected.

Technical information:

does not extract document properties and comments
extracts Masters
ignores placeholder text in Masters

PDF (Portable Document Format)

There are four standard filter configurations to handle PDF files, which vary according to the format of the target file they produce. It is possible to translate into PDF, TXT, DOCX or PPTX. See their information below. For more detailed information on translating PDFs directly, please check this article.

Technical information:

PDF to PDF

Extracts and translates all main content, including tables, headers, footers, and text formatting (bold, italic, underline)
Excludes document properties, comments, and graphical metadata.
No preservation of layout, images, styles (e.g., bold, italic), or interactive elements.
If the original file contains revisions, they are automatically accepted and included in the output
Layout may shift due to text expansion.
Fonts are replaced with defaults.
Hyperlinks and other interactive elements may be not preserved.

PDF to TXT

All text is extracted including headers and footers.
Removes all formatting tags.
No preservation of layout, images, styles (e.g., bold, italic), or interactive elements.
May not preserve text order on more complex documents due to multiple layers.
Only plain text is extracted — visual structure and design are lost.

PDF to DOCX

Styles like bold, italic, and underline are preserved.
Revisions are accepted automatically.
Headers and footers are translated.
The resulting DOCX is editable.
Layout may shift due to text expansion.
Fonts are replaced with defaults.
Hyperlinks and other interactive elements may be not preserved.

PDF to PPTX

All content is extracted and converted.
Layout and visual structure are generally well preserved.
The resulting PPTX is editable.
Fonts are often replaced with defaults.
Text styles like bold, italic, and color may not be preserved in complex documents.
Text expansion may cause layout shifts or overlapping content.

POTM (Microsoft PowerPoint)