Unbabel supports the translation of a large number of file formats. Different formats have different structures, rules and properties, which created different requirements when it comes to select which content needs to be translated and which one does not. This is done through Filter configurations. These configurations also helps the translation pipeline to deliver a file that respects and resembles the format of the original source.
These configurations will behave differently according to the file type being translated and its complexity. While simpler files such as plain text (txt) will mostly see the full extent of the text content extracted (i.e., selected for translation), translated and delivered, formats such as docx or or xlsx have more complex rules. Understanding these rules guarantees you can translate a file that is properly formatted and that you are in full control of the content that is being changed during translation. The configurations properties also enable you to manipulate the source file in ways you can find convenient - such as preventing specific content from being translated (protected content).
Below we list all the supported file formats and the properties of their standard filter configurations. When more information is available, click on the file format name to see additional information.
Supported file formats
Technical information:
- field delimiter is the comma ‘,’
- text qualifier is the double quote “”
- csv escaping mode - duplicates qualifier
- excludes qualifiers from extracted text
- excludes leading/trailing white spaces from the extracted text
- adds qualifiers to output when appropriate
- extraction mode - extracts table data
- table properties - values start at line 1 (no column with names)
- extracts data from all columns
- the number of columns is defined by values (may vary in different rows)
- allows trimming of leading/trailing spaces and tabs
- converts \t, \n , \\ and \uXXXX into characters
- separates lines with line-feeds (\n)
- includes okf_html@FP-subfilter-default
- protects generic placeholders
- assumes the document is well formed
- preserves white space
- uses codeFinder to protect generic placeholders
- assumes the document is well formed
- lists elements and attributes for translation
- does not extract document properties and comments
- translates headers and footers
- excludes graphical metadata
- automatically accepts revisions and extracts their content
- includes styles and highlights
- extracts headers and footers
- excludes graphical metadata
- includes HTML subfilter
- protects generic placeholders
- extracts master spreads
- simplifies inline codes where possible
- uses codeFinder for tag protection
- does not untag XML structures (the filter cannot put the tags back. This needs to be done in a manual DTP process, which depending on the size of the file might be an issue)
- extracts master spreads
- extracts all key/string pairs
- extracts strings without associated key
- uses key as resname
- has html subfilter which deals with embedded html and protects generic placeholders
- translates fenced code-blocks
- translates inline code blocks
- translates YAML metadata header
- translates image alt text
- placeholders are protected as inline codes. For this configuration placeholders of type #company and [checkout_date] are not protected as # and [...] are part of markdown syntax.
- uses the default embedded HTML filter configuration tailored for the Markdown filter (no html subfilter is needed)
- extracts variables
- extracts index markers
- extracts body pages
- extracts master pages
- inline code protection for fonts
- adds the target language attribute if not present
- segments only if the input text is segmented
- includes ITS markup
- balances codes
- uses a custom xml stream parser
- protects generic placeholders
- adds the target language attribute if not present
- segments only if the input text is segmented
- includes ITS markup
- balances codes
- uses a custom xml stream parser
- sets Finished segments as translate=”no”
- protects generic placeholders
- extracts everything
- extracts everything
- extracts everything
- extracts everything
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- does not extract document properties and comments
- extracts Masters
- ignores placeholder text in Masters
- uses localization directives when they are present
- extracts items outside of the scope of localization directives
- extracts comments to note properties
- converts \n and \t to line break and tab
- CodeFinder takes care of placeholders (an html subfilter deals with the embedded html)
- does not escape extended characters (\uHHHH notation)
- extracts by default //data[not(@type) and not(starts-with(@name, '>'))]/value and //data[@name='$this.Text']/value
- extracts as notes //data[not(@type) and not(starts-with(@name, '>') or starts-with(@name, '$'))]/value
- an html subfilter deals with placeholders and embedded HTML
Technical information:
- uses SDLXLIFF writer
- adds the target-language attribute if not present
- preserves whitespace by default
- skips seg-sources with no marked segments
- segments only if the input text unit is segmented
- includes ITS markup
- balances codes
- uses a custom xml stream parser
- sets Finished segments as translate=”no”
- protects generic placeholders
- a regex filter processes the .srt whilst the html subfilter deals with embedded html and protects generic placeholders
- the time-codes are not added as notes due to a limitation we found when using regex filter + html subfilter
- does not include notes (limitation we faced when using regex filter + html subfilter)
- extracts the content of the source group
- preserves whitespace
- regular expressions options: dot also matches line-feed + multiline
- uses localization directives when they are present
- extracts items outside of the scope of localization directives
- escaped characters use backslash
- mime type for the document: text/plain
- protects generic placeholders and embedded HTML
- extracts for translation /plist/dict/dict/string and /plist/dict/dict/dict/string
- does not extract strings with keys NSStringFormatSpecTypeKey and NSStringFormatValueTypeKey
- protects generic placeholders
- groups all document parts skeleton into one
- skips invalid TUs
- creates the segment if segtype is ‘sentence’ or is undefined
- string used to delimit property values when there are duplicate properties: ','
- extracts text by lines
- converts \t, \n, \\ and \uXXXX into characters
- separates lines with line-feeds (\n)
- protects generic placeholders
- uses the default okp_openxml filter
- includes HTML subfilter
- it offers no specific options for Visio
- adds the target language attribute if not present
- segments only if the input text is segmented
- includes ITS markup
- balances codes
- uses a custom xml stream parser
- sets Finished segments as translate=”no”
- protects generic placeholders
- does not extract document properties and comments
- does not extract hidden rows or columns
- does not extract sheet names
- does not extract diagram data
- does not extract drawings
- The html sub-filter deals with embedded html and protects generic placeholders.
- does not extract hidden rows or columns
- does not extract sheet names
- does not extract diagram data
- does not extract drawings
- embedded HTML and generic placeholders are protected
- does not extract hidden rows nor columns
- does not extract sheet names
- does not extract diagram data
- does not extract drawings
- includes HTML subfilter
- accepts only valid, well-formed XML
-
protects html only in CDATA
-
does not protect placeholders
- preserves whitespace
- extracts isolated strings
- extracts all pairs
- uses key as name
- uses the full key path
- does not use codeFinder
- The html subfilter deals with placeholders and embedded html.
- extracts the content of the source group using regex
- preserves whitespace
- regex options: dot also matches line-feed + multi-line
- uses localization directives when they are present
- extracts items out of the scope of localization directives
- beginning/end of string: “”
- escaped characters use backslash prefix
- mime type: text/plain
- The html subfilter deals with placeholders and embedded html
- field delimiter - tab ‘\t’
- extraction mode - extracts table data
- table properties - values start at line 1 (no column with names)
- extracts data from all columns
- the number of columns is defined by values (may vary in different rows)
- allows trimming of leading/trailing spaces and tabs
- converts \t, \n , \\ and \uXXXX into characters
- separates lines with line-feeds (\n)
- protects generic placeholders
What are generic placeholders?
Generic placeholders are character combinations that will ensure that the text in your document is not picked for translation, and instead is delivered as in the original source, placed in a correct and syntactically correct position in the translation. When a filter is able to handle these placeholders, we will mention it in the description above. If there are exceptions, they will be stated in the technical information.
The list of placeholders to be used with the Standard Filter configuration is:
Placeholder pattern | Example |
{placeholder} |
Dear {first_name}, thank you for contacting. |
${placeholder} | This will be sent to ${package_destination}. |
$((placeholder)) | Your voucher code is $((bonus_code)). |
{{placeholder}} |
Esteemed {{contact.name}}, |
%#@placeholder@ | Please submit it by %#@end_date@. |
@PLACEHOLDER | Your shipment @PACKAGE_ID arrived. (requires capitalization) |
#placeholder |
Please reach out to #department_name. (does not work with diacritics) |
%%placeholder%% | Have your %%product.name%% today! |
%placeholder | Click on %site.element to %action.1 |
Standard vs Custom filters and File Engineering Services
Standard filters are available for any Unbabel customer that has just started to translate Projects. However, some files might have requirements that are not covered by the standard configuration. When this is the case, we can either manipulate the file outside the standard configuration - usually if it's a one-time event translation - by purchasing a File Engineering service or create a custom filter that specifically accommodates the file requirements for the current and future translations.
Filter configurations outside the standard one will be looking for a specific file structure - as they were tailor made to fit certain file layouts or content distribution. It is therefore important that you make sure the file you are translating is the correct fit for selected configuration.