SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP UK DATA ARCHIVE-NLP COLLABORATION ESDS Qualidata is using semi-automated mark-up of some components of its data collections using natural language processing (NLP) and information extraction: • new partnerships created – new methods, tools and jargon to learn • new area of application for NLP to social science data • growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as publications’ abstracts • UK Data Archive, University of Essex (lead partner) • Language Technology Group, Human Communication Research Centre, School of Informatics, University of Edinburgh 18 months duration 1 March 2005 – 31 October 2006 METADATA STANDARDS The XML schema will specify a ‘reduced’ set of Text Encoding Initiative (TEI) elements: • core tag set for transcription • names, numbers, dates • links and cross references • notes and annotations . text structure • • unique to spoken texts • linking, segmentation and alignment • advanced pointing - XPointer framework • text and AV synchronisation • contextual information (participants, setting, text) interview text with XML tags embedded There's just one or two factual things first of all do you mind my asking how old you are? 49. -King Street , Woodside and Hilton. XML: enabling a standardised format for interview transcripts Information about interviewee Uh-huh. . and how old. Date of birth: 1930 were you when you left the school? Gender: female 14. Marital status: married And you work at the Occupation: pharmacy assistant moment? What sort of work do you do? Geographic region: Scotland - LP: There's just one or two factual things first of all do you mind my Well I've gone back to get shorter hours, I've went backasking how old you are? to G 24: 49. domestic, which I dinna really care for. But then I used to be LP: And what schools did you go to? in the pharmacy department at G 24: King Street, Woodside and Hilton. ARI LP: Uh-huh. . and how old were you when you left the school? G 24: 14. . . . just LP: And you work at the moment? What sort of work do you do? pharmacy assistant G 24: Well I've gone back to get shorter hours, I've went back to domestic, which I dinna really care for. But then I used to be in the pharmacy department at ARI. . . just pharmacy assistant. At least it was better than cleanin'! But then they've nae part-time workers there so. . LP: And did you work in the pharmacy long? XML: enabling webenabled display, search and browse Main aim: to explore methodological and technical solutions for ‘exposing’ digital qualitative data to make them fully shareable and exploitable. Main objectives • specify, test and propose an e. Xtended Markup Language (XML) schema for storing and marking up qualitative data • investigate requirements for contextualising qualitative data and developing standards for data documentation • develop semi-automated using natural language processing tools for preparing marked up qualitative data for sharing • research tools for publishing and interrogating data via the web – Qualitative Data Mark-Up Tools (QDMT) WHAT FEATURES DO WE NEED TO MARK-UP AND WHY? Collaboration between: And what schools did you go to? WHAT IS SQUAD? Spoken interview texts provide the clearest and most common example of the types of encoding features needed. There are three basic groups of structural features: • utterance, specific turn taker, defining idiosyncrasies in transcription • links to analytic annotation and other data types (e. g. thematic codes, • concepts, audio or video links, researcher annotations) • identifying information such as real names, company names, place names, • occupations, temporal information Identify atomic elements of information in text: • personal names • company/organisation names • locations • dates • times • percentages • occupations • monetary amounts Example: Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vicepresident of Music Masters of Milan, Inc to become operations director of Arthur Anderson. USING NLP TOOLS Information Extraction (IE) is a sub-field of NLP which aims to identify key pieces of information in texts using 'shallow' analysis techniques. A typical IE system will perform Named Entity Recognition where particular kinds of proper names and terms are identified, classified and marked up. This is a means of annotating documents with semantic metadata – enabling highly resource discovery and data exploration. The Java interface tool developed in SQUAD is called CME. ANNOTATION TOOL - ANONYMISE This tool imports marked up data from the CME NLP system. Named entities are highlighted and co -reference chains – e. g numerous references to a single person - are identified. CAPTURING AND DEFINING DATA CONTEXT Rich context enables informed re-use of data. But defining how to provide context for raw data to make it more ‘usable’ is complex. ESDS Qualidata has spent ten years working in the area of sharing qualitative data, and has done much to establish informal ways of documenting raw data. Both micro and macro level features should be considered including: how the research question was framed, the research application process, project progress, fieldwork situations, analyses processes. Fieldwork observations are useful as are timelines and political chronologies. Equally when undertaking a replication or restudy, detailed information on sampling procedures, field work approaches and question guides will be essential. Names can be anonymised with chosen pseudonyms. The references of names to pseudonyms is saved. Annotations are explored in an XML format in the NITE NXT model. NXT uses ‘stand off’ annotation – where annotation is linked to or referenced by words. SQUAD has identified a minimal generic set of elements that represent a baseline for contextualising data. QUADS has produced an edited collection on this issue as a special edition of the Journal in Methodological Innovations Online. sirius. soc. plymouth. ac. uk/~andyp/. AUDIOVISUAL ARCHIVING Archiving and exposure of qualitative data in a way that faithfully represents its origins and context is important. Linking qualitative data to other distributed data sources such as audio-visual or geo-coded data sources, such as maps can afford creative and exciting ways of visualising data. The formalised and systematic archiving and sharing of digital audio-visual data from qualitative research is fairly new. DATA EXCHANGE STANDARDS A uniformat for richly encoding qualitative research is necessary as it: • enables preservation and re-use of metadata, data and annotation • ensures consistency of presentation and description of data • supports the development of common web-based publishing and search tools • facilitates data interchange (e. g. CADAS packages) and comparison among datasets Progress: • limited formal definition of a common XML vocabulary and Document Type Definition (DTD) based on the Text Encoding Initiative (TEI) • testing of a new Qualitative Data Interchange Format (QDIF) SQUAD is helping to explore XML representation and display of audio-visual data. CONTACT TOOLS PROGRESS • defined header metadata for a standardised transcript • defined and tested generic XML models for qualitative data • tested and refined NLP tools for qualitative data • built front end to NLP named entity tools • chosen software to enable annotation of data • explored data export formats for longer-term archiving • investigated powerful XML based indexing tools for searching and retrieving data • investigated web display of multimedia data and pointers to other resources using XML - extending the functionality of ESDS Qualidata From Autumn 2006: • formalising data exchange standard • key word extraction systems to help conceptually index qualitative data – text mining collaboration • exploring grid-enabling data: e-social science collaboration quads. esds. ac. uk/squad Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO 4 3 SQ Email: quads@esds. ac. uk Tel: +44 (0)1206 872145 URL: quads. esds. ac. uk/squad