This document describes usage scenarios and related implementations for
Internationalization Tag Set
(ITS) 2.0. ITS 2.0 enhances the foundation to integrate both
automated and manual processing of human language into core Web
technologies.
Status
of This Document
This section describes the status of this document at the time of
its publication. Other documents may supersede this document. A list
of current W3C
publications and the latest revision of this technical report can be
found in the W3C
technical reports index at https://www.w3.org/TR/.
This document describes usage scenarios and related implementations for
Internationalization Tag Set
(ITS) 2.0. ITS 2.0 enhances the foundation to integrate both
automated and manual processing of human language into core Web
technologies.
The work described in this document received funding by the European
Commission (project MultilingualWeb-LT
(LT-Web) ) through the Seventh Framework Programme (FP7) in the
area of Language Technologies (Grant Agreement No. 287815).
Note
Sending comments on this
document
If you wish to make comments regarding this document, please raise
them as github
issues . Only send comments by email if you are unable to
raise issues on github (see links below). All comments are welcome.
To make it easier to track comments, please raise separate issues
or emails for each comment, and point to the section you are
commenting on using a URL for the dated version of the
document.
Publication as a Working Group Note does not imply endorsement by the
W3C Membership. This is a
draft document and may be updated, replaced or obsoleted by other
documents at any time. It is inappropriate to cite this document as
other than work in progress.
Note
The working group reached consensus to stop work on this
specification. It is being published as a Working Group Note for
archival reasons. In comparison to the previous working draft, this
document only contains editorial changes.
The W3C
Internationalization Tag Set 2.0 - developed by the W3C
MultilingualWeb-LT Working Group enhances the foundation to
integrate automated processing of human language into core Web
technologies. ITS 2.0 bears many commonalities with its predecessor ITS
1.0 but provides additional concepts that are designed to foster
the automated creation and processing of multilingual Web content. ITS
2.0 focuses on HTML, XML-based formats in general, and can leverage
processing based on the XML Localization Interchange File Format
(XLIFF), as well as the Natural Language Processing Interchange Format
(NIF).
The W3C
MultilingualWeb-LT Working Group received funding by the European
Commission (project MultilingualWeb-LT
(LT-Web)) through the Seventh Framework Programme (FP7) in the
area of Language Technologies (Grant Agreement No. 287815). As part of
their activities, project members and members of the Working Group
compiled a list of usage scenarios that exemplify how ITS 2.0 integrates
automated processing of human language into core Web technologies. These
usage scenarios - and implementations realized by the Working Group -
are sketched in this document. The usage scenarios comprise information
such as the following:
Description - An explanation of the scenario
Data category usage - An explanation how the individual ITS 2.0
data categories are involved in the automated processing (for details
on the data categories, W3C
Internationalization Tag Set 2.0)
Benefits - Reasons why the ITS 2.0 data categories enable or
enhance the automated processing
Information on Implementation Status/Issues - Links to tools and
implementers (detailed information, running software, source code
etc.)
2.
Usage scenarios
2.1
Simple Machine Translation
2.1.1
Description
Translate XML and HTML5 content via a Machine Translation (MT)
system such as Microsoft Translator.
The parts of the content that should be translated are first
extracted based on ITS 2.0 markup. The extracted parts are send to
the MT system. After translation, the translated content is merged
back with the parts that are not translation-relevant (recreating
the original XML or HTML5 format).
Benefits:
The ITS 2.0 markup provides key information to drive the
reliable extraction of translation-relevant content from both XML
and HTML5.
Processing details such as the need to preserve white space can
be passed on.
2.1.2
Data category usage
Translate - Parts that are not translation-relevant are marked
(and protected).
Locale Filter - Only the parts that pass the locale filter are
extracted. The other parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line
codes or as sub-flows.
Preserve Space - Extracted parts/text units can be annotated
with the information that whitespace is relevant and thus needs to
be preserved.
Domain - Domain values are placed into a property that can be
used to select an MT system and/or to provide domain-related
metadata to an MT system.
2.1.3
More Information and Implementation Status/Issues
Only the first occurrence of the Domain value triggers the
selection of the engine.
Preserve Space is currently not respected by the engine.
2.2
Translation Package Creation
2.2.1
Description
Create a Translation Package in OASIS XML Localization
Interchange File Format (XLIFF) from XML or HTML5 content.
Based on its ITS 2.0 metadata, the content goes through a
processing pipeline (e.g. extraction of translation-relevant
parts). At the end of the pipeline, an XLIFF package is stored.
Benefits:
The ITS 2.0 markup provides key information to drive the
reliable extraction of translation-relevant content from both XML
and HTML5.
Processing details such as the need to preserve white space can
be passed on.
Efficient version comparison and leveraging of existing
translations is possible.
Information like domain of the content, external references or
localization notes, is made available in the XLIFF package. Thus,
any XLIFF-enabled tool can make use of this information to provide
translation assistance.
Terms in the source content are marked, and thus can be matched
against a terminology database.
Constraints about storage size and allowed characters help to
meet physical requirements.
2.2.2
Data category usage
Translate - Parts that are not translation-relevant are marked
(and protected).
Locale Filter - Only the parts that pass the locale filter are
extracted. The other parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line
codes or as sub-flows.
Preserve Space - The information is mapped to xml:space
Id Value - The value is connected to the name of the extracted
text unit.
Domain - Values are placed into a corresponding okp:itsDomain
attribute.
Storage Size - The information is placed in native ITS 2.0
markup.
External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef
attribute.
Terminology - The information about terminology is placed in a
special XLIFF note element.
Localization Note - The text is place in an XLIFF note.
Allowed Characters - The pattern is placed in its:allowedCharacters.
2.2.3
More Information and Implementation Status/Issues
ITS to XLIFF and XLIFF to ITS mapping needs to be finalized
2.3
Quality Check
2.3.1
Description
Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data
exists into a tool that performs different kind of quality checks
(CheckMate, a tool for checking quality).
The XML and HTML5 content is processed based on its ITS 2.0
properties. The constraints defined with ITS 2.0 are verified by
CheckMate.
The XLIFF content is processed based on its ITS 2.0 properties.
The constraints defined with ITS 2.0 are verified by CheckMate.
Benefits:
The ITS 2.0 markup provides key information to drive the
reliable extraction of translation-relevant content from both XML
and HTML5.
The ITS 2.0 markup provides key information to drive
quality-related checks.
The ITS 2.0 markup allows all different file formats to be
handled in the same way by the quality checking tool.
2.3.2
Data category usage
Translate - Parts that are not translation-relevant are marked
(and protected).
Locale Filter - Only the parts that pass the locale filter are
extracted. The other parts are treated as 'do not translate'
content.
Element Within Text - Elements are either extracted as in-line
codes or as sub-flows.
Preserve Space - The information is mapped to the preserveSpace
field in the extracted text unit.
Id Value - The ids are used to identify all entries with an
issue.
Storage Size - The content is verified against the storage size
constraints.
Allowed Characters - The content is verified against the
pattern matching allowed characters.
2.3.3
More Information and Implementation Status/Issues
W3C uses validator.nu
as experimental validator for HTML5. For HTML5 with ITS 2.0
metadata, validator.nu generates errors, since "its-" attributes
are not valid HTML5.
The software allows validation of HTML5+ITS 2.0 with
validator.nu (soon to be deployed as HTML5+ITS 2.0 validator at W3C
validation service)
Benefits:
Allows the validation of HTML5 documents which include ITS 2.0
markup.
Detects errors in ITS 2.0 markup for HTML5.
2.5.2
Data category usage
All data categories are covered
2.5.3
More Information and Implementation Status/Issues
2.6
Interchange between Content Management System and
Translation Management System
2.6.1
Description
Content is roundtripped between a Content Management System
(CMS) and Translation Management System (TMS).
The content originates in a CMS, and gets exposed/serialized as
XHTML + ITS 2.0. This is sent to a TMS, and processed in a
workflow. Upon completion, the TMS exposes/serializes
localized/translated XHTML + ITS 2.0 to the CMS.
Facilitated coupling/interoperability between CMS and TMS.
Cost and quality benefits for Language Service Buyer (CMS side)
and Language Service Provider (TMS side).
Language Service Buyer has more control of the localization
workflow via ITS 2.0 metadata
Automatic (e.g. via data category "Translate")
Semiautomatic (e.g. via data category "Domain")
Manual (e.g. via data category "Localization Note")
2.6.2
Data category usage
Translate (global and local usage) - Parts that are not
translation-relevant are marked (and protected).
Localization Note (global and local usage) - Provide additional
information for process managers, translators and reviewers to
facilitate processing.
Domain (global usage) -
Provide additional information for process managers,
translators and reviewers to facilitate processing.
Control workflow dimensions such as selection of
dictionaries and translation memories on the TMS side.
Language Information (local usage)- Control workflow dimensions
such as selecting suitable translators and reviewers. Also adds
context information that helps to decide if a piece of content
shall or shall not be translated.
Allowed Characters (local usage) - The content is verified
against the pattern matching allowed characters to ensure that on
the TMS side, no inappropriate characters become part of the
content (e.g. due to work of a translator).
Storage Size (local usage) - The content is verified against
the storage size constraints to ensure that on the TMS side, no
capacity limitations related to the content are violated (e.g. due
to a lengthy translation).
Provenance (local usage) - Allows tracking of human agents or
software agents that processed the content on the TMS side. In the
case of updates, provenance/tracking information will enable the
TMS side to assign or propose the same human agents (translators,
or reviewers) that participated in the initial processing.
Additional data category (not part of ITS 2.0):
Readiness (global usage) - Provides information to translation
process managers (examples: When was the content was ready to be
processed? What is the deadline? What is the priority? Which
service/process variant is relevant?)
2.6.3
More Information and Implementation Status/Issues
Tools (developed by Linguaserve):
Details: modified version of internal localization workflow;
pre-production/post-production CMS XHTML + ITS 2.0 processing
engine
Language Information - Controls workflow dimensions such as
setting the source language, and the target language (via the lang
attribute of the output), it also protects the translation of
contents where the lang attribute is different from the source
language.
Localization Quality Issue - Can be provided for the translated
content by the reviser. Can be utilized for example by MT
developers to improve the MT System.
Implementation of ITS 2.0 translate data category for
attributes currently restricted to global rules
2.8
Using ITS with GNU gettext utilities/PO files
2.8.1
Description
The GNU gettext utilities assist in internationalizing and
translating in the context of UNIX-like Operating Systems. The
file format of the utilities is the GNU gettext portable object
(PO) file format.
The implementation - ITS Tool - enables roundtripping between
PO files and XML formats like mallard.
ITS Tool includes default rules for various formats, and uses
them for PO file generation.
ITS Tool is aware of various ITS 2.0 data categories in the PO
file generation step.
2.8.2
Data category usage
Preserve Space
Locale Filter
External Resource
Translate
Elements Within Text
Localization Note
Language Information
2.8.3
More Information and Implementation Status/Issues
Need to convert built-in rules to new categories, and to
deprecate extensions (not a conformance blocker).
No support for its:param (blocked by lacking support for
setting XPath variables in libxml2 Python bindings; patch
pending review).
No support support for HTML. libxml2's HTML parser does not
correctly handle HTML5. Need to evaluate other libraries.
2.9
Harnessing ITS Metadata to Improve the Human Review
Process
2.9.1
Description
The implementation - the "Reviewer's Workbench" (a desktop
application) - reads HTML, XML and XLIFF files annotated with ITS
2.0 metadata.
At each segment of the original content, the ITS metadata is
made accessible to reviewers. Reviewers can adapt the access via
user-definable filter/formatting "rules". The metadata allows
human reviewers to make efficient decisions.
During the review of translations, reviewers can add
Localization Quality Issue annotations (which are serialized as
ITS 2.0 metadata when the file is saved). Provenance annotations
are added in the background.
The combination of captured Localization Quality Issue and
Provenance data then becomes valuable data which can be used for
traditional business intelligence, or semantic web applications.
2.9.2
Data category usage
Provenance
Localization Quality Issue
Benefits:
Increases review effectiveness as reviewers can be informed by
metadata.
Harvests data during review.
Facilitates audit and quality correction.
2.9.3
More Information and Implementation Status/Issues
Application development currently at alpha stage.
Awaiting finalization of XLIFF mappings and underlying Okapi
filter support.
Application is closed source.
2.10
XLIFF-based Machine Translation
2.10.1
Description
Invoke Machine Translation (MT) from a localization workflow
using ITS 2.0 integrated with the XML Localization Interchange
File Format (XLIFF)
2.10.2
Data category usage
Domain - The domain value can be used by the MT system to
improve processing accuracy
Translate - Parts that are not translation-relevant are marked
(and protected).
MT Confidence - Assesses the confidence in the quality of the
translation generated by the MT system.
Terminology - Enforce the MT system to translate specific words
or phrases according to terminological information
Provenance - Allows tracking of human agents (content editors)
or software agents (MT systems) that processed the content.
Benefits:
The use of XLIFF allows an MT system to be integrated
seamlessly into automated localization workflows involving
commercial Translation Management Systems and Computer Assisted
Translation (CAT) tools.
The use of XLIFF and ITS 2.0 facilitates the integration
of/switch between multiple MT systems to provide alternative
translation within a single project workflow.
The use of the ITS 2.0 "translate" attribute ensures that
content is not altered by the MT system - especially if that
content is included in a translation project as context for human
agents such as translation post-editors.
The ITS 2.0 "domain" metadata in XLIFF ensures that the most
relevant MT engine can be selected by the MT system.
Combining XLIFF and ITS 2.0 "terminology" metadata enforce the
MT system to translate specific words or phrases according to
terminological information.
Integrating ITS 2.0 MT confidence scores into XLIFF target
language translation enables them to be presented to translation
post-editors.
Recording provenance information enables localization managers
to compare the performance of different MT engines or systems, or
different translation post-editors.
2.10.3
More Information and Implementation Status/Issues
2.11
XLIFF-based CMS-to-TMS Roundtripping for HTML&XML
2.11.1
Description
SOLAS - is a service-based architecture for orchestrating
localization workflows among XLIFF-aware components.
One of SOLAS components is an OKAPI based extra
Extractor/Merger service that maps ITS 2.0 categories onto XLIFF
1.2
SOLAS is also integrated with CMS-L10N, can receive/return
XLIFF jobs created by CMS-L10N.
CMS-L10N (aka LION) is basically a middleware component based
on an RDF triple store over an arbitrary CMS (tested with
Alfresco, Drupal and Wikimedia).
Can parse the source including most of the ITS 2.0 metadata and
produce XLIFF 1.2 according to a currently agreed mapping. After
the roundtrip, that is handled via SOLAS, it updates the RDF
triple store accordingly.
Benefits:
The use of ITS 2.0 and XLIFF helps to modularize and connect
specialized (single-purpose) components.
SOLAS can handle input of components aware of different ITS 2.0
categories or unaware of ITS at all and combine them. SOLAS
orchestration ensures basic ITS compliance even with ITS unaware
components. E.g. If a service provider is unaware of the translate
flag, SOLAS can filter the translation request for that provider,
so that the flag is actually interpreted.
2.11.2
Data category usage
Translate
Localization Note
Terminology
Directionality
Language Information
Elements Within Text
Domain
Text Analysis
Locale Filter
Provenance
External Resource
Target Pointer
Id Value
Preserve Space
Localization Quality Issue
Localization Quality Rating
MT Confidence
Allowed Characters
Storage Size
2.11.3
More Information and Implementation Status/Issues
Implementer: TCD/UL, Making use of MT components by Moravia and
DCU, and JSI Enrycher as Text Analysis service.
The goal is to freeze the mapping and to produce a best
practice note within lifespan of the LT-Web project.
The focus is currently on XLIFF 1.2 favoring solutions that can
be structurally preserved in XLIFF 2.0. that is the main target in
the long run.
Although all ITS categories listed above, as encoded by OKAPI or
TCD's CMS-LION, are covered, the demos in mid March show consumption
of mainly the following: translate, term, text analysis, domain,
localization note, provenance, and MT confidence. The demos involve:
An XLIFF-based source quality assurance tool (LKR by UL)
A Project Manager/Localization Engineer friendly XLIFF
Viewer/Editor (LocConnect by UL)
Integrated Machine Translation Solutions
Moravia's implementation of M4Loc and Moses with ITS 2.0
support
DCU MaTrEx with ITS 2.0 support
Fallback handling of the ITS 2.0 information within SOLAS MT
Service Mapper with services that are not ITS 2.0 aware, such as
Microsoft Bing
Details (M4Loc processing of ITS2.0 enhanced XLIFF files):
add the ability to apply ITS 2.0 local metadata through
Drupal's WYSIWYG editor.
add the ability to apply global ITS 2.0 metadata at content
mode level.
Implemented jQuery plugin to optimize the GUI of the
Translation Management tool (there is a published
jQuery download as standalone solution, too).
Benefits:
Support for ITS 2.0 in Drupal facilitates the
localization/translation of Drupal-based content.
The Drupal modules facilitate the roundtripping process from
WCMS with systems of Localization Service Provider (including
automatic content re-integration).
The Drupal modules enable tracking of provenance information
(e.g. to identify translation post-editors).
2.12.2
Data category usage
Translate - Mark content which should not be translated and
highlight this marked content.
Localization Note - Add a note for the translator to improve
his understanding of this content and can make a better
translation.
Domain - Set the domain of a text to improve the machine and
human translation process.
Provenance - Check which translator/reviser worked on content.
Allowed Characters/Storage Size - Make the translator aware of
restrictions for specific content, like not allowed characters or
a maximum length of a translation. These constraints are
automatically set by Drupal.
Text Analysis - Annotate text with terminology metadata to
improve the machine and human translation process.
2.12.3
More Information and Implementation Status/Issues
Tool: Drupal Module for editing and viewing of ITS 2.0 markup
(Cocomore AG)
2.13
Integrating ITS Content Management Interoperability
Services, and W Provenance
2.13.1
Description
Localization interoperability can be enhanced by using not just ITS
2.0 as standard. In particular, the following standards provide
additional opportunities:
OASIS Content Management Information Service (CMIS) to
externally associate multiple ITS 2.0 rules files with large sets
of documents, and to retrieve those documents regardless of the
Content Management System in use
W3C Provenance
(PROV) to track which human agents or software agents processed
the content; tracking can span multiple agents/components, while
allowing individual tracking records to be easily consolidated via
linked data approaches
Benefits:
Enables ITS 2.0 annotations to be associated with multiple
documents via the CMS without editing individual files. This
reduces source content internationalization and document
management costs. Furthermore, it reduces annotation errors.
Allows fine-grained tracking and analysis of Language
Technology (LT) components, human agents (language workers) and
service providers - even across multiple organizations, projects,
and heterogeneous process landscapes. This reduces the overhead
costs in tracking, monitoring, analyzing and optimizing the
localization workflows - especially of the critical elements
within them (e.g. MT engines, human terminologists and
translators)
Enables tracking of human linguistic judgments and their
influence on the output of LT components. Tracking data can be
curated for retraining/retuning those LT components (e.g.
Statistical Machine Translation or text analysis components)
Tracking information can be mapped to the W3C
PROV Ontology (PROV-O) which expresses the PROV Data Model using
the OWL2 Web Ontology Language (OWL2), and stored in Resource
Description Framework (RDF) triple stores.
2.13.2
Data category usage
Provenance - Tracks MT-based translation and translation
revision through a post-editing interface. Tracking is implemented
as standoff provenance records in XLIFF files. The post-editing
records detail which of the MT outputs was used if multiple MT
outputs are offered to the post-editor. The agent's ITS
annotations (from translation and translation revision) are mapped
to PROV-O triples in the accompanying RDF provenance logs.
Text analysis - Calls text analysis service (e.g. Enrycher) on
source HTML file for Named Entity Recognition annotations. These
annotations are also mapped into XLIFF files. This annotation
results in logging of activities performed on an 'analysed text'
entity in the PROV-O triple store.
Terminology - Allows text annotated by Named Entity
Recognition, as well as other phrases, to be identified as terms
and used to populate a multilingual glossary. If the text analysis
annotation returns a DBpedia reference, a query for the label used
in the equivalent target language page can be attempted to
populate the term target in the glossary. The terminology
annotation and the glossary are mapped to XLIFF as well as
resulting in a 'term' entity being tracked in the PROV-O
provenance logs.
MT Confidence - This is used to annotate - in XLIFF - the
assumed quality of output of MT engines. MT Confidence is also
tracked for the translation entities generated by MT in the PROV-O
logs.
Domain - Mapped from HTML source document to XLIFF, and used to
annotate PROV-O entities representing source units, i.e. the
source content of translation units.
Translate - Mapped from HTML source document to XLIFF, and used
to annotate PROV-O entities representing source units, i.e. the
source content of translation units.
Where available, and not already specified by explicit ITS
provenance annotation, annotatorsRef was used to derive PROV-O agent
details for specific activities, e.g. text analysis and terminology.
2.13.3
More Information and Implementation Status/Issues
2.14
Text Analysis - Named Entity Recognition and Enrichment
2.14.1
Description
Named entities (e.g. names of persons, places, or products) in
HTML content are recognized based on the Natural Language
Processing (NLP) tool - Enrycher.
The entities are enriched in the following ways:
the identity is computed/disambiguated (so that for example
London - England, and London - Ontario can be distinguished)
a category (e.g. geographic name/place) is assigned
Both the entity recognition and the enrichment generate markup
which amongst others allows tracking of the software agent/NLP
tool that was used
Enriched, disambiguated content facilitates processing for
source and target languages (amongst others since it provides
context to translators)
Benefits:
The ITS 2.0 markup provides the key information about entities,
so they can be correctly processed. Example: one may employ
specific translations, transliterations, officially mandated
translations, or even keep the original.
Content management systems may use disambiguated, enriched
content for providing entity-centric browsing and retrieval
functionality.
2.14.2
Data category usage
Text Analysis - Mark fragments of content which mention named
entities; enrich the content by additional information such as a
URI denoting the entity's identity.
Text Analysis - Mark fragments of content with individual word
meanings; enrich the content by additional information such as a
URI denoting the word's meaning.
2.14.3
More Information and Implementation Status/Issues
Implementation of NLP tools for providing the Domain data
category annotations.
2.15
Automated Terminology Annotation
2.15.1
Description
Term candidates in HTML5, XLIFF and plaintext are annotated by
humans or software agents (automatic term candidate annotation).
Automatic term candidate annotation can comprise:
Term candidate recognition based on existing terminology
resources (e.g., term banks, such as EuroTermBank or IATE)
Term candidate identification based on unguided terminology
extraction systems (e.g., ACCURAT Toolkit or TTC TermSuite)
Content analysis and terminology mark-up are performed by a Web
Service API with the following functionality:
Support for ITS 2.0 metadata (Terminology, Language
Information, Domain, Elements Within Text and Locale Filter
data categories);
Annotation of the content by the two above-mentioned
methods. The API breaks down the content in Language and
Domain dimensions and uses terminology annotation services
provided by the TaaS platform in order to identify terms and
link them with the TaaS platform.
Visualization capabilities are provided for the annotated
terminology allowing human users access to the annotation results.
Benefits: The Web Service API can be integrated in automated
language processing workflows, for instance, machine translation,
localization, terminology management and many other tasks that may
benefit from terminology annotation.
2.15.2
Data category usage
Domain - The domain information is used to split and analyze
the content per domain separately. This allows filtering terms in
the term bank-based terminology annotation as well as identifying
domain-specific content using unguided term extraction systems.
The user is asked to provide a default domain for the term
bank-based terminology annotation. This user-supplied domain will
be overridden with ITS 2.0 domain metadata if present in the
content.
Element Within Text - The information is used to decide which
elements are extracted as in-line codes and sub-flows.
Language Information - The language information is used to
split and analyze the content per language. The user will be asked
to provide a source (default) language, however, the default
language will be overridden with ITS 2.0 Language Information
metadata if present in the content.
Locale Filter - Whenever used only the text in the locale as
specified by the user defined source language is analyzed. The
remaining content is ignored.
Terminology - For existing terminology metadata, the mark-up is
preserved (terminology mark-up overlaps are not allowed). For new
terminology metadata, terms are marked according to the
Terminology data category’s rules.
2.15.3
More Information and Implementation Status/Issues
The implementation has reached Milestone 2 (Initial HTML5 term
tagging with simple visualization). The implementation for the
Milestone 3 (Enhanced HTML5 term tagging with full visualization) is
ongoing.
Detailed slides: will be made available at the end of May, 2013
Source code: will be made available at the end of May, 2013
General documentation: will be made available at the end of
May, 2013
2.16
Universal Preview of ITS Metadata in XML, XLIFF, and HTML
Files
2.16.1
Description
XML-based source content such as XLIFF files is usually provided to
translators or reviewers as reduced and partially transformed text
without any information about local or global context or support for
rendering/visualization of content itself or metadata embedded in
the content. In sum this has negative effects on quality of final
output and productivity of human workers.
The usage scenario allows rendering of content and metadata for
easy and interactive reading it as a reference material in a
browser. The rendering includes special visual cues, and interaction
possibilities (such as colour-coding and pop-ups for metadata to be
displayed). It is based on auxiliary files in HTML5+ITS 2.0
(including JavaScript) that are generated from ITS-annotated source
content of any supported formats (XML, XLIFF, HTML).
2.16.2
Data category usage
All ITS 2.0 data categories
2.16.3
More Information and Implementation Status/Issues
Implementer: Logrus
Implementation status: Prototype will display Translate,
Localization Note, and Terminology data categories at the MultilingualWeb
Workshop March 2013.
2.17
ITS in word processing software
2.17.1
Description
The tool - ITS for Libre Office Writer Extension (ILO)- allows
use of a subset of ITS 2.0 in an open source word processing
software (Libre Office).
Capabilities include:
Tagging phrases and terms as “not to translate” (translate)
Tagging words as “term” (terminology)
Tagging words for a specific locale only (locale filter)
Providing additional information for the translator
(localization note)
The Libre Office extension and its software packages allows
users to
Load ITS 2.0 annotated XML files (ODT, XLIFF)
Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre
office
Edit text related to ITS 2.0 meta data
Save and export the text and including ITS 2.0 markup into
the original file format (ODT, XLIFF)
2.17.2
Data category usage
Terminology - One or several words can be marked up as “term”
Translate – Mark content as “to translate” or “not to
translate”
Localization Note – Pass a message (information, alert) to
human agents (such as translators)
Locale Filter – Limit content to specific locales
2.17.3
More Information and Implementation Status/Issues
ILO uses OKAPI capabilities for XLIFF handling and will be
available in April 2013. The use of ILO will be presented at the MultilingualWeb
Workshop March 2013. The results of ILO development will be
given back to the public domain under the open licenses LGPL V3
(same as Libre Office).
2.18
Training for Statistical Machine Translation
2.18.1
Description
ITS 2.0 bilingual data is collected in a Content Management
System, and passed to a Statistical Machine Translation (SMT)
system for training the system's language models.
If domain information is supplied for the content, domain-aware
modules in the SMT system are trained on the corresponding
content.
Benefits:
The ITS 2.0 markup provides key information to drive the
reliable extraction of domain-specific content.
MT systems trained on domain-specific data allow for
potentially more accurate translation.
2.18.2
Data category usage
Translate - Parts that retain their original form are passed
through the MT as-is.
Language Information - Used to select the appropriate MT
language models.
Domain - Domain values direct the selection of/training of the
appropriate MT language models.
2.18.3
More Information and Implementation Status/Issues
Tools involved: Cocomore CMS and MaTrEx MT system.
Tool: MaTrEx Domain-Tuning MT Tool. The Tool is currently in
development.
3.
Authors and Implementation Contributors
Renat Bikmatov (Logrus), David Filip (University of Limerick), Leroy
Finn (Trinity College Dublin), Karl Fritsche (Cocomore AG), Serge
Gladkoff (Logrus), Declan Groves (Centre for Next Generation
Localisation (CNGL), Dublin City University), Milan Karasek (Moravia),
Jirka Kosek (University of Economics, Prague), Kevin Lew (Spartan
Software), Dave Lewis (Trinity College Dublin), Fredrik Liden (ENLASO
Corporation), Shaun McCane ((public) Invited expert), Sean Mooney
(University of Limerick), Pablo Nieto Caride (Linguaserve), Pēteris
Ņikiforovs (Tilde), David O'Carrol (University of Limerick), Philip
O'Duffy (University of Limerick), Mauricio del Olmo (Linguaserve),
Mārcis Pinnis (Tilde), Phil Ritchie (VistaTEC), Nieves Sande (German
Research Center for Artificial Intelligence (DFKI) Gmbh), Felix Sasaki (W3C
Fellow), Yves Savourel (ENLASO Corporation), Sebastian Sklarß (]init[
Europe), Ankit Srivastava (Centre for Next Generation Localisation
(CNGL), Dublin City University), Tadej Štajner (Jozef Stefan Institute),
Chase Tingley (Spartan Software), Asanka Wasala (University of
Limerick), Clemens Weins (Cocomore AG).