
82c8a40d87bf6fcee71006b0420bac8d.ppt
- Количество слайдов: 26
Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - Tagus. Park Por Bruno Martins (bgmartins@gmail. com)
Motivation for Geographic IR q Geo-information associates things and events with places. q Geo-information is abundant on the Web and on Digital Libraries. q q Collections of geo-referenced photographs. General databases of geo-referenced information. Blogs, newsfeeds, . . . Around 80% of Web pages contain references to places. q Many information needs are related to a given geographical context. q q q Find me the nearestaurants. Find me news about Lisboa. Find me photographs taken in Sintra. . Around 20% of Web searches are “local” in nature. q Geographic information is part of our everyday lives!
Existing Geographical IR Systems § Web search engines with “local search” § Yahoo! Local, Google Local, . . . § Integration with navigation mechanisms. § Mostly explore “yellow-pages” information. § Web-based GIS platforms (e. g. virtual globes) § Google Earth, . . . § Explore databases of georeferenced info. § OGC standards for Web-GIS § Photo repositories with “local search” § Flickr geo-tagging interface, . . . § Explore automatic “GPS” geo-referencing. § Many more location-based services § Advertisement, discussion communities, . . . § Location is everywhere in information systems.
Challenges for Geographical IR • Very few systems explore information on the Web texts “directly”. – They instead used databases of geo-referenced information. • Geographic context embedded in natural language descriptions. – This presents problems to automated processing. – Place names are ambiguous and get confused with names of organizations, people, buildings and streets. • Web queries depend on exact match of text terms. – Handling structured queries (e. g. “concept, relation, location”). – Intelligent interpretation of spatial relationships (“near”, “west, ” etc). – Ranking results against some measure of geographic relevance.
Geographical Information Retrieval (GIR) § Geographic information retrieval (GIR) is concerned with the retrieval of geographically referenced information objects. § Information objects can be maps, images, digital geographic data or even textual (web) documents. § New multidisciplinary field § Combines techniques from database systems, information retrieval, digital libraries, user interfaces, natural language processing, knowledge engineering, geographical information systems, . . . Geographic. Knowledge IR Information Management Systems Information Retrieval and NLP
The difference among GIR and GIS • GIS is concerned with exact spatial representations and complex analysis at the level of the individual spatial object or field. – Users are experts, information is structured and unambiguous! • GIR is concerned with retrieving geo-referenced information resources that may be relevant to a geographic query region. – Unstructured and ambiguous information, everyday applications! • Similar to the difference between search engines and relational database systems!
Geo-referencing and GIR • Information objects can be geo-referenced by either place names or by geographic coordinates (i. e. longitude & latitude) – Geographic coordinates represent exact physical location – Placenames are ambiguous (main problem of GIR) • Spatial relations may be either: – Geometric: distance and direction measured on a continuous scale. – Topological: spatially related but not directly measurable. Y X
The typical steps involved in GIR
Anatomy of a Geographical IR System Mapping User Interface Query disambiguation Query footprint Search Request + Query footprint Broker Ranked Results Info. Resourc Textual es Spatial Text Indexin g Geotagging Docume nt Footprint s Search Engine Indexe s Textual Spatial Textual Ontology a. k. a. Gazetteer Unranke d Results Ranked Results Relevance Ranking
Gazetteers / Geographic Ontology • Database containing placenames, the spatial relationships among them and the associated geographical footprints. • Support for geo-referencing with basis on the place names over text.
Roles of the Gazetteer in GIR User Interface Metadata Extraction document footprints Geo-Tagging Query Disambiguation gazetteer document collection document footprints Spatial Index Relevance Ranking Query Expansion (query footprint) Search Component
Challenges to using Gazetteers in GIR • For GIR, the gazetteer should support: – Different locations and boundary changes, integrating data from multiple sources with frequent updates. – Synonymous and variant names with differing locations for the same placename. – Different relationships among concepts. – Names in multiple languages. – “Fuzzy” regions and intra-urban place names. • More than gazetteers, we need an ontology!
Existing Gazetteer Systems/Services • Alexandria Digital Library (ADL) Gazetteer. – ~6 million entries – Has tried to standardize the format, description, and distribution of gazetteer data. – Has a published, detailed schema. – Basis for an OGC standard. • Geonames website. – Integrates information from multiple sources. – Publishes OWL ontology. – ~6 million entries. • Euro. Geo. Names project.
Geo. Tagging = Geo. Parsing+Geo. Coding Geo-parsing Recognizing geographic references, ignoring non-geographic uses of place terminology Geo-coding Attaching a unique quantitative location (footprint) to the extracted geographic references
Geo. Parsing Textual Documents • The presence of placenames can be recognized with the help of gazetteers/geo-ontologies (i. e. lists of names) • Some types of place references given over text: – the name of the place : Coimbra – an address: INESC-ID, Rua Alves Redol, 9 Lisboa – an address fragment: “Manuel lived near Largo do Rato in Lisboa” – a postcode / zip code: 2840 -137 – a phone number : most Lisbon phone numbers start with +351 21
Ambiguity in Geo. Parsing Documents Examples of false place references: • Personal names Smedes York, Jack London • • Business names Dorchester Hotel, York Properties. . Street names Oxford Street, London Road… Common words bath, battle, derby, over, well, …… Approach for handling ambiguity: – Look for patterns in surrounding context!!! – One reference per discourse.
Geo. Coding place references in text Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield, Lisboa……… • Use context to decide: references to parent or nearby places. • Choose most important one: by population or place type. • Optional step taken by some GIR approaches: • Finding a document’s encompassing geographic scope. – Combine all place references given in the document. – Use heuristics to guide the process.
Assigning documents to Geo. Scopes • Aggregate place name occurrences. – Web-a-Where system, GIPSY, . . . • Use graph-ranking algorithms (e. g. Page. Rank). – Most important nodes are the scope.
Document Indexing for Geographic IR • Different indexing strategies are possible: – Index documents with basis on gazetteer ids. – Use documents scopes to create document footprints (point, bounding rectangle, . . . ) and use footprints to index documents spatially. • Strategy for handling queries: – Convert query to a query footprint/gazetteer id. – Match query footprint to document footprints/ids. – Rank documents according to “relevance”.
Handling queries in GIR systems
Data structures for indexing in GIR Term 1 D 1, D 23, … Term 2 D 9, D 11, D 100, … Term 3 • Typical strategy is to have separate indexes. – Inverted index for text. – R-tree for footprints. D 27, D 85, . . • Access spatial index with query footprint/gazetteer id. • Access text index with query terms. • Merge results and find the intersection.
Ranking search results in GIR • Spatial similarity can indicate relevance – Documents whose spatial content is more similar to the spatial content of query should appear first. • But we need to consider both the: – Thematic relevance: BM 25, TF-IDF, . . . – Geographic relevance: proximity, containment, . . . • Geometric (e. g. distance) and non-geometric (e. g. topology) – Other importance metrics: Page. Rank • State of the art consists of doing a linear combination.
Existing GIR systems : Meta. Carta The Meta. Carta system – Pioneer system addressing all aspects given in this talk. – Conducts geo-parsing and geocoding of text documents, and sends back possible location references with relative strength scores. – Uses Natural Language Processing (NLP) to find possible location references. – Contains a gazetteer of ~14 million entries.
Other GIR Systems : Research projects • Prototype system from the SPIRIT EU project – Spatially-aware information retrieval on the Internet. – Geo-tagging of Web documents with basis on geo-ontology. • Alexandria Digital Library – Digital library of geo-referenced materials. – Focus on development of a large gazetteer. • GREASE, GIPSY, Web-a-Where, Geo. XWalk, . . . – Many more research projects addressing GIR aspects individually. – Geo. CLEF evaluation contest similar to TREC. • Project DIGMAP under development at IST – Digital library for old maps and historical cartography resources. – Indexing metadata records for geographic retrieval.
Current Challenges in Geographic IR § Improve “conventional GIR” components and methods § Geo-tagging, spatio-textual indexing and geo-relevance ranking. § Improved understanding of spatial natural language terminology. § Principled approaches for integration and evaluation of GIR. § Better user interfaces for exploration of GIR results. § Integration of geographical with temporal aspects. § Everything we do happens in space and time! § Creation of rich place ontologies with world-wide coverage. § Fuzzy regions and intra-urban placenames present challenges. § Open Geo. Information Web services and Geospatial Semantic Web.
Where To Find More Information • Georeferencing: The Geographic Associations of Information – By Linda L. Hill (Author), MIT Press • The Geospatial Web – By Arno Scharl and Klaus Tochtermann (Editors), Springer-Verlag • Proceedings of the Workshops on Geographical IR – Edited by Chris Jones and Ross Purves (4 th edition in 2007, Lisbon) • Talk to me using the email address bgmartins@gmail. com