Metadata for the Multilingual Web - Usage Scenarios and Implementations

Abstract

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, members of the Working Group and the LT-Web project created various implementations that exemplify how ITS 2.0 supports automated processing of human language into core Web technologies. These implementations/the corresponding usage scenarios are sketched in this document. Each section of the document comprises the following:

Description - An explanation of the scenario
Data category usage - An explanation which of the ITS 2.0 data categories are involved in the automated processing; (for details on the data categories, W3C Internationalization Tag Set 2.0 has to be consulted)
Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source code etc.)

2. Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant (recreating the original XML or HTML5 format).

Benefits:

The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
Processing details such as the need to preserve white space can be passed on.

2.1.2 Data category usage

Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs to be preserved.
Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related metadata to an MT system.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Details (slides)
Running software: http://code.google.com/p/okapi/downloads/list
Source code: http://code.google.com/p/okapi/source/browse/
General documentation: http://www.opentag.com/okapi/wiki/

Implementation status/issues:

Only the first occurrence of the Domain value triggers the selection of the engine.
Preserve Space is currently not respected by the engine.

2.2 Translation Package Creation

2.2.1 Description

Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts). At the end of the pipeline, an XLIFF package is stored.

Benefits:

The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
Processing details such as the need to preserve white space can be passed on.
Efficient version comparison and leveraging of existing translations is possible.
Information like domain of the content, external references or localization notes, is made available in the XLIFF package. Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
Terms in the source content are marked, and thus can be matched against a terminology database.
Constraints about storage size and allowed characters help to meet physical requirements.

2.2.2 Data category usage

Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - The information is mapped to xml:space
Id Value - The value is connected to the name of the extracted text unit.
Domain - Values are placed into a corresponding okp:itsDomain attribute.
Storage Size - The information is placed in native ITS 2.0 markup.
External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
Terminology - The information about terminology is placed in a special XLIFF note element.
Localization Note - The text is place in an XLIFF note.
Allowed Characters - The pattern is placed in its:allowedCharacters.

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Details (slides)
ITS 2.0 support in Okapi's XLIFF components: http://www.opentag.com/okapi/wiki/index.php?title=ITS_Components
Running software: http://code.google.com/p/okapi/downloads/list
Source code: http://code.google.com/p/okapi/source/browse/
General documentation: http://www.opentag.com/okapi/wiki/

Implementation status/issues:

ITS to XLIFF and XLIFF to ITS mapping needs to be finalized

2.3 Quality Check

2.3.1 Description

Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool that performs different kind of quality checks (CheckMate, a tool for checking quality).
The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.
The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.

Benefits:

The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
The ITS 2.0 markup provides key information to drive quality-related checks.
The ITS 2.0 markup allows all different file formats to be handled in the same way by the quality checking tool.

2.3.2 Data category usage

Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
Id Value - The ids are used to identify all entries with an issue.
Storage Size - The content is verified against the storage size constraints.
Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Details (slides)
Running software: http://code.google.com/p/okapi/downloads/list
Source code: http://code.google.com/p/okapi/source/browse/
General documentation: http://www.opentag.com/okapi/wiki/

Implementation status/issues:

The Okapi's quality checker step does not map its warning levels properly to the ITS severity values.

2.4 Processing HTMLdocuments with an XML tool chain

2.4.1 Description

Turn HTML5 with "its-" attributes into XHTML with "its:" prefixes.

Benefits:

Allows processing of HTML5 documents with XML tools.

2.4.2 Data category usage

All data categories are covered.

2.4.3 More Information and Implementation Status/Issues

Details: Command-line tool, which uses a general HTML5 library to create XML (see https://github.com/kosek/html5-its-tools)
Source code: https://github.com/kosek/html5-its-tools

2.5 Validating HTMLwith ITS metadata

2.5.1 Description

W3C uses validator.nu as experimental validator for HTML5. For HTML5 with ITS 2.0 metadata, validator.nu generates errors, since "its-" attributes are not valid HTML5.
The software allows validation of HTML5+ITS 2.0 with validator.nu (soon to be deployed as HTML5+ITS 2.0 validator at W3C validation service)

Benefits:

Allows the validation of HTML5 documents which include ITS 2.0 markup.
Detects errors in ITS 2.0 markup for HTML5.

2.5.2 Data category usage

All data categories are covered

2.5.3 More Information and Implementation Status/Issues

On-line validation service
Command-line tool, which uses a general HTML5 library (see https://github.com/kosek/html5-its-tools)

Source code: https://bitbucket.org/kosek and https://github.com/kosek/html5-its-tools

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

Content is roundtripped between a Content Management System (CMS) and Translation Management System (TMS).
The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0. This is sent to a TMS, and processed in a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML + ITS 2.0 to the CMS.
See ITS 2.0 for localization of content in a Web Content Management System for the description of the CMS side

Benefits:

Facilitated coupling/interoperability between CMS and TMS.
Cost and quality benefits for Language Service Buyer (CMS side) and Language Service Provider (TMS side).
Language Service Buyer has more control of the localization workflow via ITS 2.0 metadata
1. Automatic (e.g. via data category "Translate")
2. Semiautomatic (e.g. via data category "Domain")
3. Manual (e.g. via data category "Localization Note")

2.6.2 Data category usage

Translate (global and local usage) - Parts that are not translation-relevant are marked (and protected).
Localization Note (global and local usage) - Provide additional information for process managers, translators and reviewers to facilitate processing.
Domain (global usage) -
1. Provide additional information for process managers, translators and reviewers to facilitate processing.
2. Control workflow dimensions such as selection of dictionaries and translation memories on the TMS side.

Language Information (local usage)- Control workflow dimensions such as selecting suitable translators and reviewers. Also adds context information that helps to decide if a piece of content shall or shall not be translated.
Allowed Characters (local usage) - The content is verified against the pattern matching allowed characters to ensure that on the TMS side, no inappropriate characters become part of the content (e.g. due to work of a translator).
Storage Size (local usage) - The content is verified against the storage size constraints to ensure that on the TMS side, no capacity limitations related to the content are violated (e.g. due to a lengthy translation).
Provenance (local usage) - Allows tracking of human agents or software agents that processed the content on the TMS side. In the case of updates, provenance/tracking information will enable the TMS side to assign or propose the same human agents (translators, or reviewers) that participated in the initial processing.

Additional data category (not part of ITS 2.0):

Readiness (global usage) - Provides information to translation process managers (examples: When was the content was ready to be processed? What is the deadline? What is the priority? Which service/process variant is relevant?)

2.6.3 More Information and Implementation Status/Issues

Tools (developed by Linguaserve):

Details: modified version of internal localization workflow; pre-production/post-production CMS XHTML + ITS 2.0 processing engine
Running software: https://www-pre.linguaserve.net/las_demos/control/MLWLTWP3DemoEngine (credentials: user= demos; password=demosLingu@serve)
General documentation

Implementation status:

Successfully tested roundtripping Drupal XHTML files utilizing supported ITS 2.0 data categories in workflow
Used in productive translation

Implementation issues:

Compliant implementation of ITS 2.0 global rules not finished yet

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

Enable an HTML5 content reviser (language editor, translation post-editor) to add ITS 2.0 metadata to the contents of web documents.
Use the ITS 2.0 metadata to control the behavior of different Machine Translation (MT) Systems and Multilingual Publication System.
Covers post-editing of translations generated by MT.

Benefits:

The ITS 2.0 markup:

provides key information to drive the reliable extraction of translation-relevant content from HTML5;
helps to control workflow dimensions such as selection of domain-specific vocabulary to improve the Machine Translation results;
provides information for post-editing.

2.7.2 Data category usage

Translate - Parts that are not translation-relevant are marked (and protected).
- Implementers: Linguaserve, DCU, LucySoftware.
Localization Note - Provides additional information for language or translation editors to facilitate translation.
- Implementers: Linguaserve.
Language Information - Controls workflow dimensions such as setting the source language, and the target language (via the lang attribute of the output), it also protects the translation of contents where the lang attribute is different from the source language.
- Implementers: Linguaserve, DCU, LucySoftware.
Domain - Domain values are mapped to the domains used by the individual MT systems, and used to select the appropriate vocabulary.
- Implementers: Linguaserve, DCU, LucySoftware.
Provenance - Allows tracking of human agents (language or translation editors) or software agents (MT systems) that processed the content.
- Implementers: Linguaserve.
Localization Quality Issue - Can be provided for the translated content by the reviser. Can be utilized for example by MT developers to improve the MT System.
- Implementers: Linguaserve.
Locale Filter - Reveals that content is only relevant for certain locales (useful in localization).
- Implementers: DCU.
MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
- Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

Tools:

Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
Statistical MT System MaTrEx (DCU).
Rule-based MT System (LucySoftware).

Details: https://www.w3.org/International/multilingualweb/lt/wiki/Online_MT_Systems_Use_Case_Demonstration
Running software:
- ITS 2.0 ATLAS PW1 Testing Page (credentials: user=mlw-lt_public password=MLWLT4atlaspw1$).
- ITS 2.0 ATLAS PW1 Prototype (credentials: user=mlw-lt password=its2-wp4).
- Spanish Tax Agency showcase (credentials: user=mlw-lt password=its2-wp4).
- MaTrEx.
- ITS 2.0 LucySoftware prototype (credentials: user=mlwlt password=ltweb11).
Source code: Not public
General documentation:

Implementation issues:

Implementation of ITS 2.0 translate data category for attributes currently restricted to global rules

2.8 Using ITS with GNU gettext utilities/PO files

2.8.1 Description

The GNU gettext utilities assist in internationalizing and translating in the context of UNIX-like Operating Systems. The file format of the utilities is the GNU gettext portable object (PO) file format.
The implementation - ITS Tool - enables roundtripping between PO files and XML formats like mallard.
ITS Tool includes default rules for various formats, and uses them for PO file generation.
ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.

2.8.2 Data category usage

Preserve Space
Locale Filter
External Resource
Translate
Elements Within Text
Localization Note
Language Information

2.8.3 More Information and Implementation Status/Issues

Details: http://itstool.org/

Implementation status/issues:

Need to convert built-in rules to new categories, and to deprecate extensions (not a conformance blocker).
No support for its:param (blocked by lacking support for setting XPath variables in libxml2 Python bindings; patch pending review).
No support support for HTML. libxml2's HTML parser does not correctly handle HTML5. Need to evaluate other libraries.

2.9 Harnessing ITS Metadata to Improve the Human Review Process

2.9.1 Description

The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML, XML and XLIFF files annotated with ITS 2.0 metadata.
At each segment of the original content, the ITS metadata is made accessible to reviewers. Reviewers can adapt the access via user-definable filter/formatting "rules". The metadata allows human reviewers to make efficient decisions.
During the review of translations, reviewers can add Localization Quality Issue annotations (which are serialized as ITS 2.0 metadata when the file is saved). Provenance annotations are added in the background.
The combination of captured Localization Quality Issue and Provenance data then becomes valuable data which can be used for traditional business intelligence, or semantic web applications.

2.9.2 Data category usage

Provenance
Localization Quality Issue

Benefits:

Increases review effectiveness as reviewers can be informed by metadata.
Harvests data during review.
Facilitates audit and quality correction.

2.9.3 More Information and Implementation Status/Issues

Application development currently at alpha stage.
Awaiting finalization of XLIFF mappings and underlying Okapi filter support.
Application is closed source.

2.10 XLIFF-based Machine Translation

2.10.1 Description

Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated with the XML Localization Interchange File Format (XLIFF)

2.10.2 Data category usage

Domain - The domain value can be used by the MT system to improve processing accuracy
Translate - Parts that are not translation-relevant are marked (and protected).
MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
Terminology - Enforce the MT system to translate specific words or phrases according to terminological information
Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.

Benefits:

The use of XLIFF allows an MT system to be integrated seamlessly into automated localization workflows involving commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple MT systems to provide alternative translation within a single project workflow.
The use of the ITS 2.0 "translate" attribute ensures that content is not altered by the MT system - especially if that content is included in a translation project as context for human agents such as translation post-editors.
The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can be selected by the MT system.
Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate specific words or phrases according to terminological information.
Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables them to be presented to translation post-editors.
Recording provenance information enables localization managers to compare the performance of different MT engines or systems, or different translation post-editors.

2.10.3 More Information and Implementation Status/Issues

Details: http://www.w3.org/International/multilingualweb/lt/wiki/Simple_Segment_Machine_Translation_Use_Case_Demonstration
Tools: TCD CMS-LION / DCU MaTrEx

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

SOLAS - is a service-based architecture for orchestrating localization workflows among XLIFF-aware components.

One of SOLAS components is an OKAPI based extra Extractor/Merger service that maps ITS 2.0 categories onto XLIFF 1.2
SOLAS is also integrated with CMS-L10N, can receive/return XLIFF jobs created by CMS-L10N.

CMS-L10N (aka LION) is basically a middleware component based on an RDF triple store over an arbitrary CMS (tested with Alfresco, Drupal and Wikimedia).

Can parse the source including most of the ITS 2.0 metadata and produce XLIFF 1.2 according to a currently agreed mapping. After the roundtrip, that is handled via SOLAS, it updates the RDF triple store accordingly.

Benefits:

The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose) components.
SOLAS can handle input of components aware of different ITS 2.0 categories or unaware of ITS at all and combine them. SOLAS orchestration ensures basic ITS compliance even with ITS unaware components. E.g. If a service provider is unaware of the translate flag, SOLAS can filter the translation request for that provider, so that the flag is actually interpreted.

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher as Text Analysis service.

This tool is based on an ITS-XLIFF mapping:

The mapping is currently under discussion.
The goal is to freeze the mapping and to produce a best practice note within lifespan of the LT-Web project.
The focus is currently on XLIFF 1.2 favoring solutions that can be structurally preserved in XLIFF 2.0. that is the main target in the long run.

Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are covered, the demos in mid March show consumption of mainly the following: translate, term, text analysis, domain, localization note, provenance, and MT confidence. The demos involve:

An XLIFF-based source quality assurance tool (LKR by UL)
A Project Manager/Localization Engineer friendly XLIFF Viewer/Editor (LocConnect by UL)
Integrated Machine Translation Solutions

Moravia's implementation of M4Loc and Moses with ITS 2.0 support
DCU MaTrEx with ITS 2.0 support
Fallback handling of the ITS 2.0 information within SOLAS MT Service Mapper with services that are not ITS 2.0 aware, such as Microsoft Bing

Details (M4Loc processing of ITS2.0 enhanced XLIFF files):

Running software: http://mlwlt.moravia.com (testing site)
Running software (web-service): http://mlwlt.moravia.com/mlwlt-service-xliff-mt/mlwlt-service.asmx
Source code: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt
General documentation: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt/wiki

Please note that links to the running software are currently only accessible to the SOLAS system at the moment. They should become public next week.

2.12 ITS for localization of content in a Web Content Management System

2.12.1 Description

Drupal is a Web Content Management System (WCMS).
The Drupal modules, developed by Cocomore,
1. add the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
2. add the ability to apply global ITS 2.0 metadata at content mode level.
3. Implemented jQuery plugin to optimize the GUI of the Translation Management tool (there is a published jQuery download as standalone solution, too).

Benefits:

Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based content.
The Drupal modules facilitate the roundtripping process from WCMS with systems of Localization Service Provider (including automatic content re-integration).
The Drupal modules enable tracking of provenance information (e.g. to identify translation post-editors).

2.12.2 Data category usage

Translate - Mark content which should not be translated and highlight this marked content.
Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
Domain - Set the domain of a text to improve the machine and human translation process.
Provenance - Check which translator/reviser worked on content.
Allowed Characters/Storage Size - Make the translator aware of restrictions for specific content, like not allowed characters or a maximum length of a translation. These constraints are automatically set by Drupal.
Text Analysis - Annotate text with terminology metadata to improve the machine and human translation process.

2.12.3 More Information and Implementation Status/Issues

Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)

Source Code/Documentation: http://drupal.org/sandbox/kfritsche/1908292

Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)

Description of the TMS side: Interchange between Content Management System and Translation Management System
Source Code/Documentation: http://drupal.org/sandbox/kfritsche/1908422

Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG)

Details: Adds possibility to have additional steps before/after translation and integrates the Text Analysis results from "Enrycher".
Source Code/Documentation: http://drupal.org/sandbox/kfritsche/1908598

Tool: ITS 2.0 jQuery Plugin (Cocomore AG)

Details: Selector plugin to read ITS 2.0 data from a node or select nodes by specified ITS markup.
Running software: http://plugins.jquery.com/its-parser/
Source Code: https://github.com/attrib/jquery-its2-src
Documentation: https://github.com/attrib/jquery-its2

2.13 Integrating ITS Content Management Interoperability Services, and W Provenance

2.13.1 Description

Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:

OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets of documents, and to retrieve those documents regardless of the Content Management System in use
W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components, while allowing individual tracking records to be easily consolidated via linked data approaches

Benefits:

Enables ITS 2.0 annotations to be associated with multiple documents via the CMS without editing individual files. This reduces source content internationalization and document management costs. Furthermore, it reduces annotation errors.
Allows fine-grained tracking and analysis of Language Technology (LT) components, human agents (language workers) and service providers - even across multiple organizations, projects, and heterogeneous process landscapes. This reduces the overhead costs in tracking, monitoring, analyzing and optimizing the localization workflows - especially of the critical elements within them (e.g. MT engines, human terminologists and translators)
Enables tracking of human linguistic judgments and their influence on the output of LT components. Tracking data can be curated for retraining/retuning those LT components (e.g. Statistical Machine Translation or text analysis components)
Tracking information can be mapped to the W3C PROV Ontology (PROV-O) which expresses the PROV Data Model using the OWL2 Web Ontology Language (OWL2), and stored in Resource Description Framework (RDF) triple stores.

2.13.2 Data category usage

Provenance - Tracks MT-based translation and translation revision through a post-editing interface. Tracking is implemented as standoff provenance records in XLIFF files. The post-editing records detail which of the MT outputs was used if multiple MT outputs are offered to the post-editor. The agent's ITS annotations (from translation and translation revision) are mapped to PROV-O triples in the accompanying RDF provenance logs.
Text analysis - Calls text analysis service (e.g. Enrycher) on source HTML file for Named Entity Recognition annotations. These annotations are also mapped into XLIFF files. This annotation results in logging of activities performed on an 'analysed text' entity in the PROV-O triple store.
Terminology - Allows text annotated by Named Entity Recognition, as well as other phrases, to be identified as terms and used to populate a multilingual glossary. If the text analysis annotation returns a DBpedia reference, a query for the label used in the equivalent target language page can be attempted to populate the term target in the glossary. The terminology annotation and the glossary are mapped to XLIFF as well as resulting in a 'term' entity being tracked in the PROV-O provenance logs.
MT Confidence - This is used to annotate - in XLIFF - the assumed quality of output of MT engines. MT Confidence is also tracked for the translation entities generated by MT in the PROV-O logs.
Domain - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e. the source content of translation units.
Translate - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e. the source content of translation units.

Where available, and not already specified by explicit ITS provenance annotation, annotatorsRef was used to derive PROV-O agent details for specific activities, e.g. text analysis and terminology.

2.13.3 More Information and Implementation Status/Issues

Details:

2.14 Text Analysis - Named Entity Recognition and Enrichment

2.14.1 Description

Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language Processing (NLP) tool - Enrycher.
The entities are enriched in the following ways:
1. the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be distinguished)
2. a category (e.g. geographic name/place) is assigned
Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP tool that was used
Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides context to translators)

Benefits:

The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ specific translations, transliterations, officially mandated translations, or even keep the original.
Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.

2.14.2 Data category usage

Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such as a URI denoting the entity's identity.
Text Analysis - Mark fragments of content with individual word meanings; enrich the content by additional information such as a URI denoting the word's meaning.

2.14.3 More Information and Implementation Status/Issues

Running code: Enrycher demo
Source code: Enrycher-ITS2.0 data category implementation

Implementation issues and need for discussion:

Implementation of NLP tools for providing the Domain data category annotations.

2.15 Automated Terminology Annotation

2.15.1 Description

Term candidates in HTML5, XLIFF and plaintext are annotated by humans or software agents (automatic term candidate annotation).
Automatic term candidate annotation can comprise:
1. Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE)
2. Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT Toolkit or TTC TermSuite)
Content analysis and terminology mark-up are performed by a Web Service API with the following functionality:
1. Support for ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
2. Annotation of the content by the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.
Visualization capabilities are provided for the annotated terminology allowing human users access to the annotation results.

Benefits: The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Data category usage

Domain - The domain information is used to split and analyze the content per domain separately. This allows filtering terms in the term bank-based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user is asked to provide a default domain for the term bank-based terminology annotation. This user-supplied domain will be overridden with ITS 2.0 domain metadata if present in the content.
Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
Language Information - The language information is used to split and analyze the content per language. The user will be asked to provide a source (default) language, however, the default language will be overridden with ITS 2.0 Language Information metadata if present in the content.
Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The remaining content is ignored.
Terminology - For existing terminology metadata, the mark-up is preserved (terminology mark-up overlaps are not allowed). For new terminology metadata, terms are marked according to the Terminology data category’s rules.

2.15.3 More Information and Implementation Status/Issues

The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.

Detailed slides: will be made available at the end of May, 2013
Running code: http://taws.tilde.com
Source code: will be made available at the end of May, 2013
General documentation: will be made available at the end of May, 2013

2.16 Universal Preview of ITS Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

All ITS 2.0 data categories

2.16.3 More Information and Implementation Status/Issues

Implementer: Logrus

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at the MultilingualWeb Workshop March 2013.

2.17 ITS in word processing software

2.17.1 Description

The tool - ITS for Libre Office Writer Extension (ILO)- allows use of a subset of ITS 2.0 in an open source word processing software (Libre Office).
Capabilities include:
1. Tagging phrases and terms as “not to translate” (translate)
2. Tagging words as “term” (terminology)
3. Tagging words for a specific locale only (locale filter)
4. Providing additional information for the translator (localization note)
The Libre Office extension and its software packages allows users to
1. Load ITS 2.0 annotated XML files (ODT, XLIFF)
2. Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre office
3. Edit text related to ITS 2.0 meta data
4. Save and export the text and including ITS 2.0 markup into the original file format (ODT, XLIFF)

2.17.2 Data category usage

Terminology - One or several words can be marked up as “term”
Translate – Mark content as “to translate” or “not to translate”
Localization Note – Pass a message (information, alert) to human agents (such as translators)
Locale Filter – Limit content to specific locales

2.17.3 More Information and Implementation Status/Issues

ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013. The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the open licenses LGPL V3 (same as Libre Office).

2.18 Training for Statistical Machine Translation

2.18.1 Description

ITS 2.0 bilingual data is collected in a Content Management System, and passed to a Statistical Machine Translation (SMT) system for training the system's language models.
If domain information is supplied for the content, domain-aware modules in the SMT system are trained on the corresponding content.

Benefits:

The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific content.
MT systems trained on domain-specific data allow for potentially more accurate translation.

2.18.2 Data category usage

Translate - Parts that retain their original form are passed through the MT as-is.
Language Information - Used to select the appropriate MT language models.
Domain - Domain values direct the selection of/training of the appropriate MT language models.

2.18.3 More Information and Implementation Status/Issues

Details: http://www.w3.org/International/multilingualweb/lt/wiki/WP5
Tools involved: Cocomore CMS and MaTrEx MT system.
Tool: MaTrEx Domain-Tuning MT Tool. The Tool is currently in development.

Metadata for the Multilingual Web - Usage Scenarios and Implementations

W3C Working Group Note 02 March 2017

Abstract

Status of This Document

1. Introduction

2. Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

2.1.2 Data category usage

2.1.3 More Information and Implementation Status/Issues

2.2 Translation Package Creation

2.2.1 Description

2.2.2 Data category usage

2.2.3 More Information and Implementation Status/Issues

2.3 Quality Check

2.3.1 Description

2.3.2 Data category usage

2.3.3 More Information and Implementation Status/Issues

2.4 Processing HTMLdocuments with an XML tool chain

2.4.1 Description

2.4.2 Data category usage

2.4.3 More Information and Implementation Status/Issues

2.5 Validating HTMLwith ITS metadata

2.5.1 Description

2.5.2 Data category usage

2.5.3 More Information and Implementation Status/Issues

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

2.6.2 Data category usage

2.6.3 More Information and Implementation Status/Issues

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

2.7.2 Data category usage

2.7.3 More Information and Implementation Status/Issues

2.8 Using ITS with GNU gettext utilities/PO files

2.8.1 Description

2.8.2 Data category usage

2.8.3 More Information and Implementation Status/Issues

2.9 Harnessing ITS Metadata to Improve the Human Review Process

2.9.1 Description

2.9.2 Data category usage

2.9.3 More Information and Implementation Status/Issues

2.10 XLIFF-based Machine Translation

2.10.1 Description

2.10.2 Data category usage

2.10.3 More Information and Implementation Status/Issues

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

2.11.2 Data category usage

2.11.3 More Information and Implementation Status/Issues

2.12 ITS for localization of content in a Web Content Management System

2.12.1 Description

2.12.2 Data category usage

2.12.3 More Information and Implementation Status/Issues

2.13 Integrating ITS Content Management Interoperability Services, and W Provenance

2.13.1 Description

2.13.2 Data category usage

2.13.3 More Information and Implementation Status/Issues

2.14 Text Analysis - Named Entity Recognition and Enrichment

2.14.1 Description

2.14.2 Data category usage

2.14.3 More Information and Implementation Status/Issues

2.15 Automated Terminology Annotation

2.15.1 Description

2.15.2 Data category usage

2.15.3 More Information and Implementation Status/Issues

2.16 Universal Preview of ITS Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

2.16.2 Data category usage

2.16.3 More Information and Implementation Status/Issues

2.17 ITS in word processing software

2.17.1 Description

2.17.2 Data category usage

2.17.3 More Information and Implementation Status/Issues

2.18 Training for Statistical Machine Translation

2.18.1 Description

2.18.2 Data category usage

2.18.3 More Information and Implementation Status/Issues

3. Authors and Implementation Contributors