Description
Comma-separated values (CSV) files are delimited text files that, most typically, use a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields (or columns), separated by commas.
CSV files are used to save and transfer structured information in a simple, easy to read manner.
Unbabel filter specifications
When handling a CSV file, the Unbabel filter will define which content to translate and which to leave out, according to some rules. The basic are represented in this image:
Below you can find the most significant rules of the filter:
- Commas - and only commas - are assumed to separate columns unless escaped inside "" quotation marks.
- Quotation marks "" are used to qualify text as a single column/field.
- We recognize, extract and translate content from all columns and rows.
- Spaces are preserved in the file but are trimmed when the file is split or read by most softwares
- Line breaks are not kept during the translation, all text being moved to the same line.
- Can use \t, and \n.
- Treats \\ as characters and \uXXXX as UTF-8 encoded characters.
Functional placeholders
The following commands will act as placeholders, effectively blocking the content within from being translated. Placeholders and their respective text are displayed to the editors if the column/field contains more content, but can't be changed. They can be moved within the sentence though, in order to allow for syntax correction.
If a field only contains a placeholder, it is not displayed nor translated.
The following character combinations will work as placeholders (capitalization is required when present):
{placeholder} |
${placeholder} |
$((placeholder)) |
{{placeholder}} |
%#@placeholder@ |
@PLACEHOLDER |
#placeholder |
%%placeholder%% |
%placeholder |
HTML handling
All content surrounded with single angle brackets is considered HTML by the filter and is removed from all steps of the translation. Ex: I am <b>sending</b> this for translation -> I am sending this for translation.
Best practices
- Avoid using <> on anything other than HTML. This will deprive both our MT model and human editors from the content inside the brackets, which will compromise the translation.
- Make sure to use proper qualifiers to protect content you don't want to be broken into different columns
- If you're escaping a sentence with a comma and your qualifiers are not at the beginning and end of the field, the comma will break it into two fields. Ex: This is a "strange, yet true" statement represents two fields -> This is a strange and yet true. If you want to consider it a single field, send "This is a "strange, yet true" statement" instead.
- Some softwares will escape content again when creating a CSV. For example, Microsoft Excel will escape content between commas and then again at beginning and end of the cell, effectively triple-escaping, if any quotation marks are used.
- The output CSV will preserve the number of white spaces in the source, however, when read by certain softwares, spaces will be trimmed: any spaces at the end of beginning of a column are removed, and multiple spaces in succession inside qualified text are trimmed to 1.
Download the attachment for a template on a valid CSV file.