The Estonian Language in the Digital Age (White Paper Series)

Free download. Book file PDF easily for everyone and every device. You can download and read online The Estonian Language in the Digital Age (White Paper Series) file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with The Estonian Language in the Digital Age (White Paper Series) book. Happy reading The Estonian Language in the Digital Age (White Paper Series) Bookeveryone. Download file Free Book PDF The Estonian Language in the Digital Age (White Paper Series) at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF The Estonian Language in the Digital Age (White Paper Series) Pocket Guide.

The consortium also decided about the coordination of the dissemination activities. An international conference on powerful technologies for the multilingual European information society. This idea was reflected in the sessions topics: The exhibition was a good opportunity for LT companies to network and showcase their work on a European stage where they were seen by decision makers, MEPs and other VIPs.

How to ensure the sustainability of work done so far by harmonizing language resources after the EU funding stage is an issue of high importance, which must be clearly understood by all EU funded LT projects. The conference offered an excellent opportunity for all the METANORD partners to change ideas with other related EU project members and to find out more about LT companies, language research and industry possibilities in other countries.

We hope that the tremendous work of all those who made the conference happen and especially Prof. Their effort to ensure that META-A Network of Excellence forging the Multilingual Europe technology Alliance would attract more and more members and would become richer in innovative ideas on the future of language technology in Europe is remarkable. Every day the volume of terminology is growing along with the increasing volume of information available on the web.

Effi-cient terminology acquisition and manage-ment has become an essential component of intelligible translation, localization, technical writing and other professional language work. The current models for finding, sharing and using terminology data cannot keep up with a growing de-mand in multilingual Europe.

The role of terminology however is today more im-portant than ever to ensure that people communicate efficiently and precisely.

Consistent, harmonized and easily accessi-ble terminology is an extremely important prerequisite for ensuring unambiguous multilingual communication in the Europe-an Union and throughout the world. The planned workshop aimed at bringing to-gether academic and industrial players in the area of terminology and attracting holders of terminology resources.

The workshop also focused on fostering the cooperation between EU projects and re-search and development activities in the area of terminology along with sharing experience and discussing recent advances of the consolidation, harmonization, and distribution of terminology resources, as well as their practical application in vari-ous domains. The other two sister projects CESAR and METANET4U have been invited to take part in the workshop to insure coherence between the three projects and seek poten-tial synergies this time within the consoli-dation of multilingual terminology re-sources.

The topic of the conference — A Strategy for Multilingual Europe. During the conference META Exhibition will be organized including 40 exhibits, booths and stands from research and industry. An Open Resource Exchange Infrastructure. Also Andrejs Vasiljevs will present "Translation Cloud" in the session: Overview and Key Results" in the session: The event run smoothly, judging from the partici-pants and interested parts.

In this year conference all the members of the Meta-Net projects: Therefore all the projects took a fair space in project village, in the main conference and together arranged a work-shop. Also projects were presented in various poster and oral sessions. Traditionally, LREC brings together a large number of specialists interested and working in the field of language resources and their evaluation. We would like to invite everyone interested in language resources to visit the META-NORD booth and learn about the goals and recent advances of the project.

Eight papers were given and six posters presented on various aspects of LT, such as speech synthesizers, search engines, speech recognizers, corpora and language in society. The conference was hugely successful and the approximately 70 attendants were quite impressed with the work being done on LT in Iceland. In conjunction with the conference, Icelandic LT received considerable attention in the media: It is one of the services offered by the Forum of e-Excellence EMF , the European cross-stakeholders' network promoting excellence in the digital economy.

Meta-Nord was invited to the event as the event aimed to link knowledge about semantic technologies, languages, databases and publications. The event offered information and demonstrations of language banks, semantic technology and modern publishing solutions. One presentation showed how HFST can be used for improving input for mobile phone text messages and another demonstrated how the quality of spelling suggestions can be improved.

Recommendations

Partners of the project discussed first year results, progress so far in each work package. The plan of activities and future tasks have been discussed and considered: To sum up, the main goal to spread the information about the project among participants of the biggest Book Fair in the Baltic countries was reached. The presentation focused on the validation of different methods for wordnet building; an approach which gained high interest at the conference. The seminar that was held at the Lithuanian Language Institute was aimed at mobilising the research community, state organisations and business.

They were introduced to the first-year results of Meta-Nord , were given a presentation and took part in a discussion about the Meta-Share idea and possibilities. The representatives who are working on meta data resources demonstrated the work that had been done to make the Lithuanian language resources available for the Meta-Share system. This was a promotion event for Meta-Share. Within this course different language resources and applications of LRT were presented. On August 24 — September 4 the EUROLAN Summer school , the venue was Cluj-Napoca, Romania, in the heart of Transylvania, provided one week of intensive study of the natural language processing technologies currently under development to support industrial applications.

Internationally known scholars, researchers with the particular involvement of scientists from the Multilingual Europe Technology Alliance — META , as well as industrials involved in leading-edge work in innovative areas of natural language processing gave lectures at the school tutorials, hands-on labs and demos to share with students in-depth understanding and experience. On the 30th of September in Helsinki Metadata workshop took place. Before the workshop each participant was asked to collect as many questions related to problems with creating metadata as possible, since part of the meeting time was dedicated to solving encountered issues.

The exhibition reflected both the research and industry aspects of the META community. That was a fantastic opportunity for everybody to network with others in the field and to showcase their work to a large and influential audience drawn from stakeholders across research, industry and various national and European institutions.

The whitepapers report on the state of each European language with respect to Language Technology and explains the most urgent risks and chances. The series covers all official European languages. The workshop participants discussed how existing resources can be made more visible and available, and initiated discussions on how closed resources can be made accessible, especially with the aim of fostering multilingual and cross-lingual language technology not only between the Nordic languages but also between Nordic majority and minority languages as well as between Nordic and non-Nordic languages.

Word processors correct mistakes in our writings, we depend on search engines to find information, we use machine translation to access information in foreign language, we give voice commands to cars and voice queries to our smartphones — language and technologies are intertwined everywhere. Language technology is a key enabler of the knowledge society Rehm and Uszkoreit, a. Comprehensive support and usage of language in popular technological platforms nowadays is as important as was the introduction of language into printing a few hundred years ago.

The so called secondary Guttenberg effect puts a language under the threat of extinction if it is not sufficiently armed for the digital age. This resembles the process of the gradual disappearance of many languages that were not used in printed publications. For this reason the leading European experts have raised the alarm about the differences in technology support between the various European languages.

The 30 volumes of this series describe language technology support for the 23 official European Union languages and several other major European languages Rehm and Uszkoreit, b. According to the expert findings only the largest European languages have comprehensive technological support. Major gaps and deficiencies in key technological tools and resources put the long term survival of at least 21 European languages under serious risk. These risks are particularly worrying for the Baltic languages Latvian and Lithuanian that are among the smallest official EU languages. The Lithuanian economy is in the 83 rd position and Latvian in st position on the global scale 1 IMF, Consequently these markets are not sufficiently large to motivate global IT companies to make significant investments in technological development for the Baltic languages.

National research budgets in the Baltic countries are very limited. Government budget appropriation for research and development in these countries are among the very lowest in EU Eurostat, Public investments are by far insufficient to lay the necessary research foundation for scientifically grounded language technology development. The technological gap has further widened due to the lack of dedicated language technology funding in the EU 6 th Framework Programme for Research and Development after significant investments for other EU languages in the ies and ties when the Baltic countries were not yet members of the EU.

In this paper we describe the approach taken by the language technology company Tilde 2 to address these challenges in its mission to provide Baltic users with the same technological opportunities that are enjoyed by the larger language communities. Tilde was established in in Latvia right after this country regained its independence. In these years Tilde has grown to become the leading language technology and localization service company in the Baltic countries with offices in Riga, Vilnius and Tallinn.

Tilde develops widely used software tools for Baltic languages like spelling and grammar checkers, electronic dictionaries, spoken language technologies, terminology databases and machine translation systems. Latvian and Lithuanian are the only two surviving languages of the Baltic branch of the Indo-European language family. These are among the oldest European languages. It should be noted that Estonian — the language of the third Baltic country — is not in the same linguistic group and is very different from Latvian and Lithuanian.

DFKI-LT - Publications

It does not belong to the family of Indo-European languages but to the Finnic branch of Uralic languages. Though apparently small, Lithuanian and Latvian rank th and th respectively among the most spoken languages out of about 6, languages on our planet. Latvian is spoken by about 1. Lithuanian is spoken by twice as many native speakers totaling more than 3 million speakers in Lithuania and other countries. Both Latvian and Lithuanian are highly inflected languages that have rich morphology with various derivational means. In Baltic languages word order is relatively free and syntactic relations are determined by morphological forms.

Baltic languages in the digital age

Both languages are based on the Latin script alphabet supplemented with diacritical marks that differ for these languages. The Lithuanian alphabet has 32 letters and the Latvian 33 letters. The Lithuanian language is considered the most conservative of the living Indo-European languages preserving many features that have since disappeared in other languages. Although more evolved over time Latvian also shares many similar features like rich inflections, free word order and complex grammatical structure. Latvian and Lithuanian are among the 23 official languages of European Union and are the sole official languages in the Republic of Latvia and Republic of Lithuania respectively.

Although Latvian and Lithuanian belong to the same language group and are spoken in neighboring countries, speakers of both languages cannot communicate with each other freely Rehm and Uszkoreit, b. Both languages are well represented on the Web. The number of internet domains with the extension. Assessment included the evaluation of language technology support and the core application areas of language and speech technology e. The comparison presents the situation for four key areas: This study puts the smaller languages of the Baltic region — Latvian, and Lithuanian — in the last cluster, defined as having major gaps for all of the four key areas see Table 1.

However, tools for information extraction, machine translation, and speech recognition, as well as resources such as parallel corpora, speech corpora, and grammar, are rather simple and have limited functionality for some of the languages. For the most advanced tools and resources, such as discourse processing, dialogue management, semantics and discourse corpora, and ontological resources, most of the languages either have nothing of the kind, or their tools and resources have a quite limited scope.

Besides objective limitations dealing with complex languages there are also other obstacles like the lack of continuity in research and development funding. Due to limited funding, Latvian language technology support has not reached the quality and coverage not only of that for English, but also for many under-resourced languages of the Baltic and Nordic region with a smaller number of speakers. Icelandic, Irish, Latvian, Lithuanian, Maltese.

With internet infrastructure becoming omnipresent, language diversity remains one of the last barriers in accessing information and cross-national communication. The exponentially growing volume of multilingual information by far exceeds the capacity of human translators to meet demand for translation. Machine translation is the only viable solution for instant and cheap access to information in foreign languages. This is why machine translation MT is among the most critical language technologies also for the Baltic languages. Machine translation has been a particularly difficult problem in the area of natural language processing since it was first proposed as a concept in the early ies.

Till recently the dominant approach in MT was the so-called rule-based strategy. It is based on linguistic rules and rich translation lexicons to analyze source language text and generate translation in the target language. This approach was widely used in developing MT solutions for larger languages and resulted in numerous commercial MT systems, e.

The major drawback of the rule-based MT strategy is the immense time and human resources that are needed for every language pair and for enhancing the quality of translation. In recent years statistical machine translation SMT has provided a major breakthrough in MT development. SMT provides a cost effective and fast way to create MT systems. In this approach statistical models are built by analyzing huge volumes of parallel and monolingual text to guide the MT system in translating translation model and establishing the target language sentences language model.

In the translation process, from all the possible translation candidates the SMT system selects the one with the highest statistical probability to be the translation of a given source sentence. Another factor which facilitated the development of MT for many languages was the availability of the EU translation corpus and other parallel data on the internet.

The EuroMatrix project 3 demonstrated how open source tools and publicly available data can be used to generate SMT systems for all language pairs of the official EU languages. However, the quality of an SMT system largely depends on the size of training data. Obviously the majority of parallel data is in major languages. As a result SMT systems for larger languages are of much better quality compared to systems for under-resourced languages.

This quality gap is further deepened due to the complex linguistic structure of Baltic languages. To learn the complexity of rich morphological structure and free word order from corpus data by statistical methods, much larger volumes of training data are needed than for languages with simpler linguistic structure. For example, although the popular SMT system Google Translator currently covers more than 70 languages, translation quality for Baltic languages is significantly worse than for English, French, Spanish and other larger languages.

Tilde is putting a great deal of effort into collecting data to train SMT systems for Baltic languages. The OPUS translated text collection Tiedemann, contains publicly available texts from the web in different domains. Among the proprietary corpora collected by Tilde is localization parallel corpus obtained from translation memories created by the Tilde Localization department during translation of software content, appliance user manuals and software help content. To increase word coverage we supplement parallel texts with word and phrase translations from bilingual dictionaries.


  1. Discover the world's research!
  2. Real-Resumes for Social Work & Counseling Jobs (Real-Resumes Series)!
  3. Thoughts Are Things by Prentice Mulford: Essays Selected from the White Cross Library - 1908.

Assessing baseline SMT systems trained on this data we spotted that they were weak at picking the correct inflectional forms for translated lexical units. This very negatively affected the fluency of translation, particularly for adjective-noun and subject-object agreement. This showed that for highly inflectional languages a language model over surface forms might not be sufficient to estimate the probability of target sentence reliably. The tags contain relevant morphologic properties case, number, gender, etc. We applied these tags as additional factors to the so called factored SMT to improve local word agreement and inter-phrase consistency.

Although in automated evaluation BLEU metric scores showed only a slight improvement To make it easier and faster to use potential of existing open SMT technologies for Latvian, Lithuanian and other smaller languages, Tilde initiated the development of the online cloud-based SMT platform called LetsMT! An easy to use online interface allows the user to select parallel and monolingual data from the cloud repository, specify a few parameters and launch cloud-based training of custom SMT system.

The user can also train a system on his own proprietary data uploaded to the system.

At Least 21 European Languages in Danger of Digital Extinction – FITA Malta

Depending on the amount of data selected, training of SMT system may take from an hour to a couple of days. User trained systems are automatically evaluated using BLEU and other popular metrics. This is particularly handy for running several experiments to find the combination of training data that results in the best quality for a particular translation domain. Localization and translation businesses as well as other professional translators can use the LetsMT!

It helped Tilde to create numerous domain and task specific Latvian and Lithuanian SMT systems that provide significantly better translation quality than Google Translate. Currently available parallel corpora for Baltic languages is not sufficient to further advance the quality of SMT.

In contrast to the notion of a parallel corpus, a comparable corpus can be defined as collection of similar documents that are collected according to a set of criteria, e. Comparable corpora have several obvious advantages over parallel corpora — they can draw on much richer, more available and more diverse sources which are produced every day e. Although the majority of these texts are not direct translations, they share a lot of common paragraphs, sentences, phrases, terms and named entities in different languages. Expansion of Web content with daily multilingual news feeds and large knowledge bases like Wikipedia make comparable corpora more widely available than parallel corpora.

It contains tools for collecting comparable corpora, measuring comparability, data alignment at different levels and extraction of data useful for training statistical machine translation SMT systems. Besides the task specific tools, the toolkit also contains two general-purpose workflow chaining tools for particular usage scenarios: Tilde experimented with SMT domain adaptation for Baltic languages utilizing bilingual terms and bilingual comparable corpora collected from the Web.

The results of these experiments showed that integration of terminology within SMT systems even with simple techniques adding translated term pairs to the parallel data corpus or adding an in-domain language model can achieve an SMT system quality improvement of up to Transformation of translation model phrase tables into term-aware phrase tables can boost the quality up to Data collected for Baltic languages supplements parallel and monolingual data stored in the repository of LetsMT!