Скачать презентацию Sharing Data and Language Resources Technical Aspects and Скачать презентацию Sharing Data and Language Resources Technical Aspects and

0d0c0f31c2720a6dc1e91e3f964b0f20.ppt

  • Количество слайдов: 38

Sharing Data and Language Resources: Technical Aspects and Best Practices Stelios Piperidis ELRC, ILSP/Athena Sharing Data and Language Resources: Technical Aspects and Best Practices Stelios Piperidis ELRC, ILSP/Athena RC ELRC Workshop in Romania, 23. 03. 2016 1

Illustration of data packaging workflow Data LRs (Language Resources) Value chain activity Basic Identification Illustration of data packaging workflow Data LRs (Language Resources) Value chain activity Basic Identification & Selection of Data documentation Cleaning & Conversion (content, container) Validation Processing of LRs (e. g. Alignment) Description & Storage of LRs Legal Status determination PSI vs Licensing Privacy handling and acceptance (i. e. anonymization) Upload data to the Repository & Sharing i Market knowledge i Industry Partnership network ELRC Workshop in Romania, 23. 03. 2016 ELRC Public Partner ELRC / EC 2

Issues to address (1) Basic documentati on Legal status determination • Identification of sources Issues to address (1) Basic documentati on Legal status determination • Identification of sources • Identification and selection of data sets (raw data) • Identification & Selection of Data Legal issues • PSI vs. Licensing • Privacy and ethics management Partnership i Market knowledge i Industry network ELRC Workshop in Romania, 23. 03. 2016 3

Legal issues • Procedural Issues – Open data by default e. g. PSI – Legal issues • Procedural Issues – Open data by default e. g. PSI – Data requests • Licensing – ELRC can help with the procedures – Model licensing agreements • Government Open Licenses • Standard Re-use Licenses • License interoperability ELRC Workshop in Romania, 23. 03. 2016 4

Issues to address (2) Identification & Selection of Data Basic documentati on • (Languages, Issues to address (2) Identification & Selection of Data Basic documentati on • (Languages, Domains, year, …) • Legal status determination Documentation with basic identification elements Technical issues • PSI vs. Licensing Choice of Medium and Data formats for the transfer of the “raw” data (preference for the ELRC ad hoc platform) Partnership i Market knowledge i Industry network ELRC Workshop in Romania, 23. 03. 2016 5

Any digital textual data !! ELRC Workshop in Romania, 23. 03. 2016 6 Any digital textual data !! ELRC Workshop in Romania, 23. 03. 2016 6

Issues to address (3) Cleaning & Conversion (content, container) Technical issues (cont) • Cleaning Issues to address (3) Cleaning & Conversion (content, container) Technical issues (cont) • Cleaning of data format • encoding Character sets e. g. UTF 8 • discarding formatting, e. g. bold, italic; graphics, ads, Privacy handling and acceptance (i. e. anonymization) tables, html tags, etc. • … i Market ELRC knowledge i Industry network ELRC Workshop in Romania, 23. 03. 2016 7

Formatting example Greece is a place of culture, the arts and sciences. Its Η Formatting example Greece is a place of culture, the arts and sciences. Its Η Ελλάδα αποτελεί έναν χώρο πολιτισμού, τέχνης και tradition of contribution to global cultural and scientific communities, επιστημών. Η μακραίωνη συμβολή της στο παγκόσμιο combined with its outstanding natural beauty and excellent γίγνεσθαι, σε συνδυασμό με το μοναδικό φυσικό κάλλος και infrastructure, has made it an ideal place in which to hold τις άρτιες υποδομές, την καθιστούν ιδανικό τόπο conferences. Over the last few years, Greece has more and more διεξαγωγής συνεδρίων. Τα τελευταία χρόνια, η ελληνική Greece is a place of culture, the arts and Η Ελλάδα αποτελεί έναν χώρο πολιτισμού, frequently welcomed people of letters, sciences and the arts, who επικράτεια υποδέχεται όλο και συχνότερα ανθρώπους των sciences. Its tradition of contribution to τέχνης και επιστημών. Η μακραίωνη have participated in symposia, conferences and exhibitions. Athens γραμμάτων, των επιστημών και των τεχνών, οι οποίοι global cultural and scientific communities, συμβολή της στο παγκόσμιο γίγνεσθαι, σε International Airport ‘Eleftherios Venizelos’, one of the most modern συμμετέχουν σε συμπόσια, συνέδρια και εκθέσεις. Ο Διεθνής combined with its outstanding natural συνδυασμό με το μοναδικό φυσικό κάλλος airports in the world in operation since 2001, greatly boosted the Αερολιμένας Αθηνών «Ελευθέριος Βενιζέλος» , ένα από τα beauty and excellent infrastructure, has και τις άρτιες υποδομές, την καθιστούν organization of international conferences. πλέον σύγχρονα αεροδρόμια παγκοσμίως, ο οποίος made it an ideal place in which to hold ιδανικό τόπο διεξαγωγής συνεδρίων. Τα λειτουργεί από το 2001, έδωσε μεγάλη ώθηση στη conferences. Over the last few years, τελευταία χρόνια, η ελληνική επικράτεια διοργάνωση διεθνών συνεδρίων. Greece has more and more frequently υποδέχεται όλο και συχνότερα ανθρώπους welcomed people of letters, sciences and των γραμμάτων, των επιστημών και των the arts, who have participated in symposia, τεχνών, οι οποίοι συμμετέχουν σε conferences and exhibitions. Athens συμπόσια, συνέδρια και εκθέσεις. Ο International Airport ‘Eleftherios Venizelos’, Διεθνής Αερολιμένας Αθηνών «Ελευθέριος one of the most modern airports in the Βενιζέλος» , ένα από τα πλέον σύγχρονα world in operation since 2001, greatly αεροδρόμια παγκοσμίως, ο οποίος boosted the organization of international λειτουργεί από το 2001, έδωσε μεγάλη conferences. ώθηση στη διοργάνωση διεθνών συνεδρίων. ELRC Workshop in Romania, 23. 03. 2016 8

Issues to address (4) Cleaning & Conversion (content, container) Technical issues (cont) • File Issues to address (4) Cleaning & Conversion (content, container) Technical issues (cont) • File cleaning (e. g. conversion to XML, XLIFF, etc. ) • Data anonymization Privacy handling and acceptance (i. e. anonymization) i Market ELRC knowledge i Industry network ELRC Workshop in Romania, 23. 03. 2016 9

Data anonymization • Identify a large source of data on individuals, organizations etc. • Data anonymization • Identify a large source of data on individuals, organizations etc. • Use a Named Entity Recognizer (NER) to find and remove private biodata (names, locations, dates, birth information, etc. ) and replace with generic placeholders • Confirm results meet acceptable requirements – Reject data if anonymization is not accurate as required ELRC Workshop in Romania, 23. 03. 2016 10

Issues to address (5) Validation • Validation and Quality control of the output of Issues to address (5) Validation • Validation and Quality control of the output of the anonymization procedure • Validation and Quality Control of the output (Language Resource format, content) accept / reject LR Public partner ELRC Workshop in Romania, 23. 03. 2016 11

Issues to address (6) Processing of LRs (e. g. Alignment) Description & Storage of Issues to address (6) Processing of LRs (e. g. Alignment) Description & Storage of LRs • Data preparation and processing for Automated Translation tools (e. g. Alignment) • Description of the Language Resource (metadata) • Packaging and delivery (Data Repository with e-sharing) to EC and Owner Upload data to the Repository & Sharing i Market ELRC / ΕΕ knowledge i Industry network ELRC Workshop in Romania, 23. 03. 2016 12

Cooperation actions • Identification of sources • Identification and selection of data sets (raw Cooperation actions • Identification of sources • Identification and selection of data sets (raw data) – Data can be obtained from the visible sources (e. g. harvested from web) – Data can be handed over by the public sector players – Public sector players can boost the identification of visible sources • Processing indicated above can be carried out in cooperation by the ELRC and the data provider ELRC Workshop in Romania, 23. 03. 2016 13

How ELRC can help? • Support for all procedures and technical issues – Support How ELRC can help? • Support for all procedures and technical issues – Support services • ELRC portal ELRC Workshop in Romania, 23. 03. 2016 14

ELRC portal www. lr-coordination. eu Screen shot goes here ELRC Workshop in Romania, 23. ELRC portal www. lr-coordination. eu Screen shot goes here ELRC Workshop in Romania, 23. 03. 2016 15

How ELRC can help? • Support for all procedures and technical issues – Support How ELRC can help? • Support for all procedures and technical issues – Support services • ELRC portal • technical & legal support helpdesk ELRC Workshop in Romania, 23. 03. 2016 16

ELRC portal: Helpdesk Screen shot goes here ELRC Workshop in Romania, 23. 03. 2016 ELRC portal: Helpdesk Screen shot goes here ELRC Workshop in Romania, 23. 03. 2016 17

How ELRC can help? • Support for all procedures and technical issues – Support How ELRC can help? • Support for all procedures and technical issues – Support services • ELRC portal • technical & legal support helpdesk • forum ELRC Workshop in Romania, 23. 03. 2016 18

ELRC Portal: Web Forum Screen shot goes here ELRC Workshop in Romania, 23. 03. ELRC Portal: Web Forum Screen shot goes here ELRC Workshop in Romania, 23. 03. 2016 19

How ELRC can help? • Support for all procedures and technical issues – Support How ELRC can help? • Support for all procedures and technical issues – Support services • • ELRC portal technical & legal support helpdesk forum repository for sharing LRs ELRC Workshop in Romania, 23. 03. 2016 20

ELRC-SHARE repository (1) ELRC Workshop in Romania, 23. 03. 2016 21 ELRC-SHARE repository (1) ELRC Workshop in Romania, 23. 03. 2016 21

ELRC-SHARE repository (2) ELRC Workshop in Romania, 23. 03. 2016 22 ELRC-SHARE repository (2) ELRC Workshop in Romania, 23. 03. 2016 22

How to Contribute Data (1/8) • Go to the ELRC-SHARE Repository: elrc-share. ilsp. gr How to Contribute Data (1/8) • Go to the ELRC-SHARE Repository: elrc-share. ilsp. gr • Click the Register button ELRC Workshop in Romania, 23. 03. 2016 23

How to Contribute Data (2/8) Register / Login • Register • Activate account • How to Contribute Data (2/8) Register / Login • Register • Activate account • Login ELRC Workshop in Romania, 23. 03. 2016 Contribut e data • Describe • Upload 24

How to Contribute Data (3/8) • • • Fill in the info Read the How to Contribute Data (3/8) • • • Fill in the info Read the Terms of Service and click Accept if you agree Click the Create Account button ELRC Workshop in Romania, 23. 03. 2016 25

How to Contribute Data (4/8) • • Your request is acknowledged an activation email How to Contribute Data (4/8) • • Your request is acknowledged an activation email is sent to the address you indicated Check your email and click the activation link ELRC Workshop in Romania, 23. 03. 2016 26

How to Contribute Data (5/8) • You get redirected to the data contribution form How to Contribute Data (5/8) • You get redirected to the data contribution form (or click the Contribute Resources button) ELRC Workshop in Romania, 23. 03. 2016 27

How to Contribute Data (6/8) • Fill in the details of the dataset ELRC How to Contribute Data (6/8) • Fill in the details of the dataset ELRC Workshop in Romania, 23. 03. 2016 28

How to Contribute Data (7/8) • Browse your computer for the respective. zip file How to Contribute Data (7/8) • Browse your computer for the respective. zip file containing your data • Click Submit ELRC Workshop in Romania, 23. 03. 2016 29

How to Contribute Data (8/8) • Repeat the process if you want to contribute How to Contribute Data (8/8) • Repeat the process if you want to contribute another resource, or log out ELRC Workshop in Romania, 23. 03. 2016 30

Conclusions • Repurposing existing data (human translations) is the best way to improve Automated Conclusions • Repurposing existing data (human translations) is the best way to improve Automated Translation quality • Data-driven paradigms provide an efficient way to leverage value from existing resources • ELRC can help reviewing data for suitability (at any phase) • Do not underestimate the value of your language resources, foresee a Data Management Plan ELRC Workshop in Romania, 23. 03. 2016 31

Best practice for the future: Capitalize on your valuable data Best Practice in Data Best practice for the future: Capitalize on your valuable data Best Practice in Data Management ELRC Workshop in Romania, 23. 03. 2016 32

My data in the future • Now that I know the value of data, My data in the future • Now that I know the value of data, what should my plans be? • What are the best ways to collect, maintain, archive and re-use my data • In particular how can I use it for improving MT performances? ELRC Workshop in Romania, 23. 03. 2016 33

Main phases of data development Value chain activity Basic Identification & Selection of Data Main phases of data development Value chain activity Basic Identification & Selection of Data documentation Cleaning & Conversion (content, container) Validation Processing of LRs (e. g. Alignment) Description & Storage of LRs Legal Status determination PSI vs Licensing Privacy handling and acceptance (i. e. anonymization) Upload data to the Repository & Sustainable storage Sharing i Market knowledge This i Industry network can be part of the data management plan (DMP) ELRC Workshop in Romania, 23. 03. 2016 34

Concerns in creating a DMP • Anticipate all potential legal issues – Ensure that Concerns in creating a DMP • Anticipate all potential legal issues – Ensure that your data IPRs are cleared – Ensure that the producing parties adhere to your right “ownership” (e. g. relations with LSP: ensure you keep all rights) – Ensure that all produced intermediary documents are yours (e. g. translation memories) – Check the privacy issues in advance and plan for anonymization if necessary • Define your management plan with respect to the task – This has to account for the main goal (e. g. document writing, doc translation, etc. ) • Plan for repurposing (from documentation to LRs) – Request data in a usable format (not only PDFs but also TMX/Word/XML/TXT) – Make sure that your data uses up-to-date medium (no CDs? ) • Foresee for future publication and sharing as Public Sector Information (PSI) ELRC Workshop in Romania, 23. 03. 2016 35

Key elements of a Data Management Plan – Specifications • Ensure that the original Key elements of a Data Management Plan – Specifications • Ensure that the original documents are described • Ensure that your needs are described • Anticipate what you can get as valuable resources (a side effect) – Production • Whether internal or outsourced, check that the tools used are compatible with your needs and beyond (e. g. CAT, MT, etc. ) • Ask for the list of tools and production software • Check if you can get texts in the multiple languages aligned to each other • Keep a clear documentation of the data being produced (metadata) ELRC Workshop in Romania, 23. 03. 2016 36

Key elements of a Data Management Plan – Validation • In addition to your Key elements of a Data Management Plan – Validation • In addition to your quality control, you may want to use some of the validation tools (alignment editors, etc. ) – Sharing/distribution • Ensure your data falls within the PSI directive as transposed in your country • If not, foresee an open and permissive licence • If privacy is an issue, plan necessary procedures to handle these – Maintenance/preservation • See how ELRC can assist you • There is also the option of national/ European open data portal ELRC Workshop in Romania, 23. 03. 2016 37

Key elements of a Data Management Plan ELRC Workshop in Romania, 23. 03. 2016 Key elements of a Data Management Plan ELRC Workshop in Romania, 23. 03. 2016 38