The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into
core Web technologies. ITS 2.0 bears many
commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation
and processing of multilingual Web content.
ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing
based on the XML Localization Interchange
File Format (XLIFF), as well as the Natural Language Processing Interchange Format
(NIF).
The W3C MultilingualWeb-LT Working Group received funding by the European Commission
(project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies
(Grant Agreement No. 287815). As part
of their activities, members of the Working Group and the LT-Web project created various
implementations that exemplify how
ITS 2.0 supports automated processing of human language into core Web technologies.
These implementations/the corresponding
usage scenarios are sketched in this document. Each section of the document comprises
the following:
Description - An explanation of the scenario
Data category usage - An explanation which of the ITS 2.0 data categories are involved
in the automated processing; (for
details on the data categories, W3C Internationalization Tag Set 2.0 has to be consulted)
Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated
processing
Information on Implementation Status/Issues - Links to tools and implementers (detailed
information, running software, source
code etc.)
Status of This Document
This section describes the status of this
document at the time of its publication. Other documents may supersede
this document. A list of current W3C publications and the latest revision
of this technical report can be found in the
W3C technical reports index at
https://www.w3.org/TR/.
This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS)
2.0. ITS 2.0 enhances the foundation to integrate both automated and
manual processing of human language into core Web technologies.
The work described in this document receives funding by the European
Commission (project MultilingualWeb-LT (LT-Web) ) through the Seventh Framework
Programme (FP7) in the area of Language Technologies (Grant Agreement No.
287815).
Publication as a Working Group Note does not imply endorsement by the W3C
Membership. This is a draft document and may be updated, replaced or
obsoleted by other documents at any time. It is inappropriate to cite this
document as other than work in progress.
This document was produced by a group
operating under the
W3C Patent Policy.
The group does not expect this document to become a W3C Recommendation.
The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into
core Web technologies. ITS 2.0 bears many
commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation
and processing of multilingual Web content.
ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing
based on the XML Localization Interchange
File Format (XLIFF), as well as the Natural Language Processing Interchange Format
(NIF).
The W3C MultilingualWeb-LT Working Group received funding by the European Commission
(project MultilingualWeb-LT (LT-Web)) through the Seventh Framework Programme (FP7) in the area of Language Technologies
(Grant Agreement No. 287815). As part
of their activities, project members and members of the Working Group compiled a list
of usage scenarios that exemplify how
ITS 2.0 integrates automated processing of human language into core Web technologies.
These usage scenarios - and implementations
realized by the Working Group - are sketched in this document. The usage scenarios
comprise information such as the following:
Description - An explanation of the scenario
Data category usage - An explanation how the individual ITS 2.0 data categories are
involved in the automated processing
(for details on the data categories, W3C Internationalization Tag Set 2.0)
Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated
processing
Information on Implementation Status/Issues - Links to tools and implementers (detailed
information, running software, source
code etc.)
2. Usage scenarios
2.1 Simple Machine Translation
2.1.1 Description
Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft
Translator.
The parts of the content that should be translated are first extracted based on ITS
2.0 markup. The extracted parts are send
to the MT system. After translation, the translated content is merged back with the
parts that are not translation-relevant
(recreating the original XML or HTML5 format).
Benefits:
The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant
content from both XML
and HTML5.
Processing details such as the need to preserve white space can be passed on.
2.1.2 Data category usage
Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other
parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - Extracted parts/text units can be annotated with the information
that whitespace is relevant and thus needs
to be preserved.
Domain - Domain values are placed into a property that can be used to select an MT
system and/or to provide domain-related
metadata to an MT system.
2.1.3 More Information and Implementation Status/Issues
Only the first occurrence of the Domain value triggers the selection of the engine.
Preserve Space is currently not respected by the engine.
2.2 Translation Package Creation
2.2.1 Description
Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF)
from XML or HTML5 content.
Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g.
extraction of translation-relevant parts).
At the end of the pipeline, an XLIFF package is stored.
Benefits:
The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant
content from both XML
and HTML5.
Processing details such as the need to preserve white space can be passed on.
Efficient version comparison and leveraging of existing translations is possible.
Information like domain of the content, external references or localization notes,
is made available in the XLIFF package.
Thus, any XLIFF-enabled tool can make use of this information to provide translation
assistance.
Terms in the source content are marked, and thus can be matched against a terminology
database.
Constraints about storage size and allowed characters help to meet physical requirements.
2.2.2 Data category usage
Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other
parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - The information is mapped to xml:space
Id Value - The value is connected to the name of the extracted text unit.
Domain - Values are placed into a corresponding okp:itsDomain attribute.
Storage Size - The information is placed in native ITS 2.0 markup.
External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
Terminology - The information about terminology is placed in a special XLIFF note
element.
Localization Note - The text is place in an XLIFF note.
Allowed Characters - The pattern is placed in its:allowedCharacters.
2.2.3 More Information and Implementation Status/Issues
ITS to XLIFF and XLIFF to ITS mapping needs to be finalized
2.3 Quality Check
2.3.1 Description
Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool
that performs different kind of quality
checks (CheckMate, a tool for checking quality).
The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints
defined with ITS 2.0 are verified
by CheckMate.
The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined
with ITS 2.0 are verified by CheckMate.
Benefits:
The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant
content from both XML
and HTML5.
The ITS 2.0 markup provides key information to drive quality-related checks.
The ITS 2.0 markup allows all different file formats to be handled in the same way
by the quality checking tool.
2.3.2 Data category usage
Translate - Parts that are not translation-relevant are marked (and protected).
Locale Filter - Only the parts that pass the locale filter are extracted. The other
parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
Preserve Space - The information is mapped to the preserveSpace field in the extracted
text unit.
Id Value - The ids are used to identify all entries with an issue.
Storage Size - The content is verified against the storage size constraints.
Allowed Characters - The content is verified against the pattern matching allowed
characters.
2.3.3 More Information and Implementation Status/Issues
W3C uses validator.nu as experimental validator for HTML5. For HTML5 with ITS 2.0 metadata, validator.nu
generates errors,
since "its-" attributes are not valid HTML5.
The software allows validation of HTML5+ITS 2.0 with validator.nu (soon to be deployed
as HTML5+ITS 2.0 validator at W3C
validation service)
Benefits:
Allows the validation of HTML5 documents which include ITS 2.0 markup.
Detects errors in ITS 2.0 markup for HTML5.
2.5.2 Data category usage
All data categories are covered
2.5.3 More Information and Implementation Status/Issues
2.6 Interchange between Content Management System and Translation Management System
2.6.1 Description
Content is roundtripped between a Content Management System (CMS) and Translation
Management System (TMS).
The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0.
This is sent to a TMS, and processed in
a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML
+ ITS 2.0 to the CMS.
Facilitated coupling/interoperability between CMS and TMS.
Cost and quality benefits for Language Service Buyer (CMS side) and Language Service
Provider (TMS side).
Language Service Buyer has more control of the localization workflow via ITS 2.0
metadata
Automatic (e.g. via data category "Translate")
Semiautomatic (e.g. via data category "Domain")
Manual (e.g. via data category "Localization Note")
2.6.2 Data category usage
Translate (global and local usage) - Parts that are not translation-relevant are
marked (and protected).
Localization Note (global and local usage) - Provide additional information for process
managers, translators and reviewers
to facilitate processing.
Domain (global usage) -
Provide additional information for process managers, translators and reviewers to
facilitate processing.
Control workflow dimensions such as selection of dictionaries and translation memories
on the TMS side.
Language Information (local usage)- Control workflow dimensions such as selecting
suitable translators and reviewers. Also
adds context information that helps to decide if a piece of content shall or shall
not be translated.
Allowed Characters (local usage) - The content is verified against the pattern matching
allowed characters to ensure that
on the TMS side, no inappropriate characters become part of the content (e.g. due
to work of a translator).
Storage Size (local usage) - The content is verified against the storage size constraints
to ensure that on the TMS side,
no capacity limitations related to the content are violated (e.g. due to a lengthy
translation).
Provenance (local usage) - Allows tracking of human agents or software agents that
processed the content on the TMS side.
In the case of updates, provenance/tracking information will enable the TMS side to
assign or propose the same human agents
(translators, or reviewers) that participated in the initial processing.
Additional data category (not part of ITS 2.0):
Readiness (global usage) - Provides information to translation process managers (examples:
When was the content was ready
to be processed? What is the deadline? What is the priority? Which service/process
variant is relevant?)
2.6.3 More Information and Implementation Status/Issues
Tools (developed by Linguaserve):
Details: modified version of internal localization workflow; pre-production/post-production
CMS XHTML + ITS 2.0 processing
engine
Language Information - Controls workflow dimensions such as setting the source language,
and the target language (via the
lang attribute of the output), it also protects the translation of contents where
the lang attribute is different from the
source language.
Localization Quality Issue - Can be provided for the translated content by the reviser.
Can be utilized for example by MT
developers to improve the MT System.
Implementation of ITS 2.0 translate data category for attributes currently restricted
to global rules
2.8 Using ITS with GNU gettext utilities/PO files
2.8.1 Description
The GNU gettext utilities assist in internationalizing and translating in the context
of UNIX-like Operating Systems. The
file format of the utilities is the GNU gettext portable object (PO) file format.
The implementation - ITS Tool - enables roundtripping between PO files and XML formats
like mallard.
ITS Tool includes default rules for various formats, and uses them for PO file generation.
ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.
2.8.2 Data category usage
Preserve Space
Locale Filter
External Resource
Translate
Elements Within Text
Localization Note
Language Information
2.8.3 More Information and Implementation Status/Issues
Need to convert built-in rules to new categories, and to deprecate extensions (not
a conformance blocker).
No support for its:param (blocked by lacking support for setting XPath variables
in libxml2 Python bindings; patch pending review).
No support support for HTML. libxml2's HTML parser does not correctly handle HTML5.
Need to evaluate other libraries.
2.9 Harnessing ITS Metadata to Improve the Human Review Process
2.9.1 Description
The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML,
XML and XLIFF files annotated with
ITS 2.0 metadata.
At each segment of the original content, the ITS metadata is made accessible to reviewers.
Reviewers can adapt the access
via user-definable filter/formatting "rules". The metadata allows human reviewers
to make efficient decisions.
During the review of translations, reviewers can add Localization Quality Issue annotations
(which are serialized as ITS
2.0 metadata when the file is saved). Provenance annotations are added in the background.
The combination of captured Localization Quality Issue and Provenance data then becomes
valuable data which can be used for
traditional business intelligence, or semantic web applications.
2.9.2 Data category usage
Provenance
Localization Quality Issue
Benefits:
Increases review effectiveness as reviewers can be informed by metadata.
Harvests data during review.
Facilitates audit and quality correction.
2.9.3 More Information and Implementation Status/Issues
Application development currently at alpha stage.
Awaiting finalization of XLIFF mappings and underlying Okapi filter support.
Application is closed source.
2.10 XLIFF-based Machine Translation
2.10.1 Description
Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated
with the XML Localization Interchange
File Format (XLIFF)
2.10.2 Data category usage
Domain - The domain value can be used by the MT system to improve processing accuracy
Translate - Parts that are not translation-relevant are marked (and protected).
MT Confidence - Assesses the confidence in the quality of the translation generated
by the MT system.
Terminology - Enforce the MT system to translate specific words or phrases according
to terminological information
Provenance - Allows tracking of human agents (content editors) or software agents
(MT systems) that processed the content.
Benefits:
The use of XLIFF allows an MT system to be integrated seamlessly into automated localization
workflows involving commercial
Translation Management Systems and Computer Assisted Translation (CAT) tools.
The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple
MT systems to provide alternative translation
within a single project workflow.
The use of the ITS 2.0 "translate" attribute ensures that content is not altered
by the MT system - especially if that content
is included in a translation project as context for human agents such as translation
post-editors.
The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can
be selected by the MT system.
Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate
specific words or phrases according
to terminological information.
Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables
them to be presented to translation
post-editors.
Recording provenance information enables localization managers to compare the performance
of different MT engines or systems,
or different translation post-editors.
2.10.3 More Information and Implementation Status/Issues
2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML
2.11.1 Description
SOLAS - is a service-based architecture for orchestrating localization workflows
among XLIFF-aware components.
One of SOLAS components is an OKAPI based extra Extractor/Merger service that maps
ITS 2.0 categories onto XLIFF 1.2
SOLAS is also integrated with CMS-L10N, can receive/return XLIFF jobs created by
CMS-L10N.
CMS-L10N (aka LION) is basically a middleware component based on an RDF triple store
over an arbitrary CMS (tested with Alfresco,
Drupal and Wikimedia).
Can parse the source including most of the ITS 2.0 metadata and produce XLIFF 1.2
according to a currently agreed mapping.
After the roundtrip, that is handled via SOLAS, it updates the RDF triple store accordingly.
Benefits:
The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose)
components.
SOLAS can handle input of components aware of different ITS 2.0 categories or unaware
of ITS at all and combine them. SOLAS
orchestration ensures basic ITS compliance even with ITS unaware components. E.g.
If a service provider is unaware of the
translate flag, SOLAS can filter the translation request for that provider, so that
the flag is actually interpreted.
2.11.2 Data category usage
Translate
Localization Note
Terminology
Directionality
Language Information
Elements Within Text
Domain
Text Analysis
Locale Filter
Provenance
External Resource
Target Pointer
Id Value
Preserve Space
Localization Quality Issue
Localization Quality Rating
MT Confidence
Allowed Characters
Storage Size
2.11.3 More Information and Implementation Status/Issues
Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher
as Text Analysis service.
The goal is to freeze the mapping and to produce a best practice note within lifespan
of the LT-Web project.
The focus is currently on XLIFF 1.2 favoring solutions that can be structurally preserved
in XLIFF 2.0. that is the main
target in the long run.
Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are
covered, the demos in mid March show
consumption of mainly the following: translate, term, text analysis, domain, localization
note, provenance, and MT confidence.
The demos involve:
An XLIFF-based source quality assurance tool (LKR by UL)
A Project Manager/Localization Engineer friendly XLIFF Viewer/Editor (LocConnect
by UL)
Integrated Machine Translation Solutions
Moravia's implementation of M4Loc and Moses with ITS 2.0 support
DCU MaTrEx with ITS 2.0 support
Fallback handling of the ITS 2.0 information within SOLAS MT Service Mapper with
services that are not ITS 2.0 aware, such
as Microsoft Bing
Details (M4Loc processing of ITS2.0 enhanced XLIFF files):
add the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
add the ability to apply global ITS 2.0 metadata at content mode level.
Implemented jQuery plugin to optimize the GUI of the Translation Management tool
(there is a published jQuery download as standalone solution, too).
Benefits:
Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based
content.
The Drupal modules facilitate the roundtripping process from WCMS with systems of
Localization Service Provider (including
automatic content re-integration).
The Drupal modules enable tracking of provenance information (e.g. to identify translation
post-editors).
2.12.2 Data category usage
Translate - Mark content which should not be translated and highlight this marked
content.
Localization Note - Add a note for the translator to improve his understanding of
this content and can make a better translation.
Domain - Set the domain of a text to improve the machine and human translation process.
Provenance - Check which translator/reviser worked on content.
Allowed Characters/Storage Size - Make the translator aware of restrictions for specific
content, like not allowed characters
or a maximum length of a translation. These constraints are automatically set by Drupal.
Text Analysis - Annotate text with terminology metadata to improve the machine and
human translation process.
2.12.3 More Information and Implementation Status/Issues
Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)
2.13 Integrating ITS Content Management Interoperability Services, and W Provenance
2.13.1 Description
Localization interoperability can be enhanced by using not just ITS 2.0 as standard.
In particular, the following standards
provide additional opportunities:
OASIS Content Management Information Service (CMIS) to externally associate multiple
ITS 2.0 rules files with large sets
of documents, and to retrieve those documents regardless of the Content Management
System in use
W3C Provenance (PROV) to track which human agents or software agents processed the
content; tracking can span multiple agents/components,
while allowing individual tracking records to be easily consolidated via linked data
approaches
Benefits:
Enables ITS 2.0 annotations to be associated with multiple documents via the CMS
without editing individual files. This reduces
source content internationalization and document management costs. Furthermore, it
reduces annotation errors.
Allows fine-grained tracking and analysis of Language Technology (LT) components,
human agents (language workers) and service
providers - even across multiple organizations, projects, and heterogeneous process
landscapes. This reduces the overhead costs
in tracking, monitoring, analyzing and optimizing the localization workflows - especially
of the critical elements within
them (e.g. MT engines, human terminologists and translators)
Enables tracking of human linguistic judgments and their influence on the output
of LT components. Tracking data can be
curated for retraining/retuning those LT components (e.g. Statistical Machine Translation
or text analysis components)
Tracking information can be mapped to the W3C PROV Ontology (PROV-O) which expresses
the PROV Data Model using the OWL2 Web
Ontology Language (OWL2), and stored in Resource Description Framework (RDF) triple
stores.
2.13.2 Data category usage
Provenance - Tracks MT-based translation and translation revision through a post-editing
interface. Tracking is implemented
as standoff provenance records in XLIFF files. The post-editing records detail which
of the MT outputs was used if multiple
MT outputs are offered to the post-editor. The agent's ITS annotations (from translation
and translation revision) are mapped
to PROV-O triples in the accompanying RDF provenance logs.
Text analysis - Calls text analysis service (e.g. Enrycher) on source HTML file for
Named Entity Recognition annotations.
These annotations are also mapped into XLIFF files. This annotation results in logging
of activities performed on an 'analysed
text' entity in the PROV-O triple store.
Terminology - Allows text annotated by Named Entity Recognition, as well as other
phrases, to be identified as terms and
used to populate a multilingual glossary. If the text analysis annotation returns
a DBpedia reference, a query for the label
used in the equivalent target language page can be attempted to populate the term
target in the glossary. The terminology
annotation and the glossary are mapped to XLIFF as well as resulting in a 'term' entity
being tracked in the PROV-O provenance
logs.
MT Confidence - This is used to annotate - in XLIFF - the assumed quality of output
of MT engines. MT Confidence is also
tracked for the translation entities generated by MT in the PROV-O logs.
Domain - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities
representing source units, i.e.
the source content of translation units.
Translate - Mapped from HTML source document to XLIFF, and used to annotate PROV-O
entities representing source units, i.e.
the source content of translation units.
Where available, and not already specified by explicit ITS provenance annotation,
annotatorsRef was used to derive PROV-O
agent details for specific activities, e.g. text analysis and terminology.
2.13.3 More Information and Implementation Status/Issues
2.14 Text Analysis - Named Entity Recognition and Enrichment
2.14.1 Description
Named entities (e.g. names of persons, places, or products) in HTML content are recognized
based on the Natural Language
Processing (NLP) tool - Enrycher.
The entities are enriched in the following ways:
the identity is computed/disambiguated (so that for example London - England, and
London - Ontario can be distinguished)
a category (e.g. geographic name/place) is assigned
Both the entity recognition and the enrichment generate markup which amongst others
allows tracking of the software agent/NLP
tool that was used
Enriched, disambiguated content facilitates processing for source and target languages
(amongst others since it provides
context to translators)
Benefits:
The ITS 2.0 markup provides the key information about entities, so they can be correctly
processed. Example: one may employ
specific translations, transliterations, officially mandated translations, or even
keep the original.
Content management systems may use disambiguated, enriched content for providing
entity-centric browsing and retrieval functionality.
2.14.2 Data category usage
Text Analysis - Mark fragments of content which mention named entities; enrich the
content by additional information such
as a URI denoting the entity's identity.
Text Analysis - Mark fragments of content with individual word meanings; enrich the
content by additional information such
as a URI denoting the word's meaning.
2.14.3 More Information and Implementation Status/Issues
Implementation of NLP tools for providing the Domain data category annotations.
2.15 Automated Terminology Annotation
2.15.1 Description
Term candidates in HTML5, XLIFF and plaintext are annotated by humans or software
agents (automatic term candidate annotation).
Automatic term candidate annotation can comprise:
Term candidate recognition based on existing terminology resources (e.g., term banks,
such as EuroTermBank or IATE)
Term candidate identification based on unguided terminology extraction systems (e.g.,
ACCURAT Toolkit or TTC TermSuite)
Content analysis and terminology mark-up are performed by a Web Service API with
the following functionality:
Support for ITS 2.0 metadata (Terminology, Language Information, Domain, Elements
Within Text and Locale Filter data categories);
Annotation of the content by the two above-mentioned methods. The API breaks down
the content in Language and Domain dimensions
and uses terminology annotation services provided by the TaaS platform in order to
identify terms and link them with the TaaS
platform.
Visualization capabilities are provided for the annotated terminology allowing human
users access to the annotation results.
Benefits:
The Web Service API can be integrated in automated language processing workflows,
for instance, machine translation, localization,
terminology management and many other tasks that may benefit from terminology annotation.
2.15.2 Data category usage
Domain - The domain information is used to split and analyze the content per domain
separately. This allows filtering terms
in the term bank-based terminology annotation as well as identifying domain-specific
content using unguided term extraction
systems. The user is asked to provide a default domain for the term bank-based terminology
annotation. This user-supplied
domain will be overridden with ITS 2.0 domain metadata if present in the content.
Element Within Text - The information is used to decide which elements are extracted
as in-line codes and sub-flows.
Language Information - The language information is used to split and analyze the
content per language. The user will be asked
to provide a source (default) language, however, the default language will be overridden
with ITS 2.0 Language Information
metadata if present in the content.
Locale Filter - Whenever used only the text in the locale as specified by the user
defined source language is analyzed. The
remaining content is ignored.
Terminology - For existing terminology metadata, the mark-up is preserved (terminology
mark-up overlaps are not allowed).
For new terminology metadata, terms are marked according to the Terminology data category’s
rules.
2.15.3 More Information and Implementation Status/Issues
The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple
visualization). The implementation for
the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.
Detailed slides: will be made available at the end of May, 2013
Source code: will be made available at the end of May, 2013
General documentation: will be made available at the end of May, 2013
2.16 Universal Preview of ITS Metadata in XML, XLIFF, and HTML Files
2.16.1 Description
XML-based source content such as XLIFF files is usually provided to translators or
reviewers as reduced and partially transformed
text without any information about local or global context or support for rendering/visualization
of content itself or metadata
embedded in the content. In sum this has negative effects on quality of final output
and productivity of human workers.
The usage scenario allows rendering of content and metadata for easy and interactive
reading it as a reference material in
a browser. The rendering includes special visual cues, and interaction possibilities
(such as colour-coding and pop-ups for
metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including
JavaScript) that are generated from
ITS-annotated source content of any supported formats (XML, XLIFF, HTML).
2.16.2 Data category usage
All ITS 2.0 data categories
2.16.3 More Information and Implementation Status/Issues
Implementer: Logrus
Implementation status: Prototype will display Translate, Localization Note, and Terminology
data categories at the MultilingualWeb
Workshop March 2013.
2.17 ITS in word processing software
2.17.1 Description
The tool - ITS for Libre Office Writer Extension (ILO)- allows use of a subset of
ITS 2.0 in an open source word processing
software (Libre Office).
Capabilities include:
Tagging phrases and terms as “not to translate” (translate)
Tagging words as “term” (terminology)
Tagging words for a specific locale only (locale filter)
Providing additional information for the translator (localization note)
The Libre Office extension and its software packages allows users to
Load ITS 2.0 annotated XML files (ODT, XLIFF)
Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre office
Edit text related to ITS 2.0 meta data
Save and export the text and including ITS 2.0 markup into the original file format
(ODT, XLIFF)
2.17.2 Data category usage
Terminology - One or several words can be marked up as “term”
Translate – Mark content as “to translate” or “not to translate”
Localization Note – Pass a message (information, alert) to human agents (such as
translators)
Locale Filter – Limit content to specific locales
2.17.3 More Information and Implementation Status/Issues
ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013.
The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the
open licenses
LGPL V3 (same as Libre Office).
2.18 Training for Statistical Machine Translation
2.18.1 Description
ITS 2.0 bilingual data is collected in a Content Management System, and passed to
a Statistical Machine Translation (SMT)
system for training the system's language models.
If domain information is supplied for the content, domain-aware modules in the SMT
system are trained on the corresponding
content.
Benefits:
The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific
content.
MT systems trained on domain-specific data allow for potentially more accurate translation.
2.18.2 Data category usage
Translate - Parts that retain their original form are passed through the MT as-is.
Language Information - Used to select the appropriate MT language models.
Domain - Domain values direct the selection of/training of the appropriate MT language
models.
2.18.3 More Information and Implementation Status/Issues
Tools involved: Cocomore CMS and MaTrEx MT system.
Tool: MaTrEx Domain-Tuning MT Tool. The Tool is currently in development.
3. Authors and Implementation Contributors
Renat Bikmatov (Logrus),
David Filip (University of Limerick),
Leroy Finn (Trinity College Dublin),
Karl Fritsche (Cocomore AG),
Serge Gladkoff (Logrus),
Declan Groves (Centre for Next Generation Localisation (CNGL), Dublin City University),
Milan Karasek (Moravia),
Jirka Kosek (University of Economics, Prague),
Kevin Lew (Spartan Software),
Dave Lewis (Trinity College Dublin),
Fredrik Liden (ENLASO Corporation),
Shaun McCane ((public) Invited expert),
Sean Mooney (University of Limerick),
Pablo Nieto Caride (Linguaserve),
Pēteris Ņikiforovs (Tilde),
David O'Carrol (University of Limerick),
Philip O'Duffy (University of Limerick),
Mauricio del Olmo (Linguaserve),
Mārcis Pinnis (Tilde),
Phil Ritchie (VistaTEC),
Nieves Sande (German Research Center for Artificial Intelligence (DFKI) Gmbh),
Felix Sasaki (W3C Fellow),
Yves Savourel (ENLASO Corporation),
Sebastian Sklarß (]init[ Europe),
Ankit Srivastava (Centre for Next Generation Localisation (CNGL), Dublin City University),
Tadej Štajner (Jozef Stefan Institute),
Chase Tingley (Spartan Software),
Asanka Wasala (University of Limerick),
Clemens Weins (Cocomore AG).