- Количество слайдов: 30
LIS 618 lecture 1 Thomas Krichel 2002 -09 -15
Organization • homepage http: //wotan. liu. edu/home/krichel/lis 618 n 02 a • Contents to be discussed today. • Send mail to [email protected] org – Your name – Your secret word for grades delivery • Interrupt me with as many questions as possible! • Ask for breaks!
Proposed Organization • • • Normal lecture Quiz at the beginning of every lecture. Main quiz next week (25% of grade) Search exercise 55% Other quizzes 10% • Formal syllabus to be made early next week!
Search exercise • find victim • conduct interview about an information need experienced by the victim, write down expectations • search in Dialog and on web • discuss results with the victim • write essay, no longer than 7 pages.
Structure of talk • First talk about me, then about you and the course • General round trip on theoretical matters. – – – Context of database searching Database searching and information retrieval The retrieval process Information retrieval models Retrieval performance evaluation Query languages • Logging on to Dialog • Web searching exercise (if time permits)
About me • Born 1965, in Völklingen (Germany) • Studied economics and social sciences at the Universities of Toulouse, Paris, Exeter and Leiceister. • Ph. D in theoretical macroeconomics • Lecturer in Economics at the University of Surrey 1993 and 2001 • Since 2001 assistant professor at the Palmer School
Why? • During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature. • At the same time, I discovered easy access to freely downloadable software over the Internet. • I decided to work towards downloadable scientific documents. This lead to my library career (eventually).
Steps taken I • 1993 founded the Net. Ec project at http: //netec. mcc. ac. uk, later available at http: //netec. ier. hit-u. ac. jp as well as at http: //netec. wustl. edu. • These are networking projects targeted to the economics community. The bulk is – Information about working papers – Downloadable working papers – Journal articles were added later
Steps taken II • Set up Re. PEc, a digital library for economics research. Catalogs – Research documents – Collections of research documents – Researchers themselves – Organizations that are important to the research process • Decentralized collection, model for the open archives initiative
Steps taken III • Co-founder of Open Archives Initiative • Work on the Academic Metadata Format • Co-founded rclis, a Re. PEc clone for (Research in Computing, Library and Information Science)
summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces.
Database searching (DS) • subset of the subject of information retrieval (IR) • DS mainly thought as applicable to the set of large structured databases as opposed to do web searching • for those, a general knowledge of what databases are seems useful • Concentrate on textual databases
traditional social model • user goes to a library • describes problem to the librarian • librarian does the search – without the user present – with the user present • hands over the result to the user • user fetches full-text or asks a librarian to fetch the full text.
economic rational for traditional model • In olden days the cost of telecommunication was high. • database use costs – cost of communication – cost of access time to the database • the traditional model controls an upper bound on costs
disintermediation • with access cost time gone, the traditional model is under threat • there is disintermediation where the librarian looses her role • but that may not be good news for information retrieval results – user knows subject matter best – librarian knows searching best
Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • with more and more data appearing on the web DS may be a subject in decline, because it is primarily concerned with nonweb databases
Main theory part • Literature: "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto • Don't buy it. It is a not a good book.
before the IR process • provider – define data that is available • documents that can be used • document operations • document structure – index • user – user need – IR system familiarity
the IR process • query expresses user need in a query language • processing of query yields retrieved documents • calculation of relevance ranking • examination of retrieved documents • possible relevance cycle
main problem • user is not an expert at the formulation of a query • garbage in garbage out, the retrieval yields poor result • ways out – design very intuitive interface – give expert guidance
key aid: index • index term is a part of the document that has a meaning on its own (usually a noun) • retrieval based on index term raises questions – semantics in query or document is lost – matching done in imprecise space of index terms • predicting relevance is a central problem • the IR model determines the process of relevance ranking
taxonomy of classic IR models • Boolean, or set-theoretic – fuzzy set models – extended Boolean • vector, or algebraic – generalized vector model – latent semantic indexing – neural network model • probabilistic – inference network – belief network
basic concepts: index term • an index term is a word whose semantics help to remember the document's main themes. • nouns are mainly used • if all words are index terms, the logical view of the document is full text
basic concept: weight of index term • given all nouns, not all appear to have the same relevance to the text • sometimes, we can have a simple measure of the importance of a term, example? • more generally, for each indexing term and each document we can associate a weight with the term and the document. • usually, if the document does not contain the term, its weight is zero
basic concept: mutual term independence • Thinking of the weight of a term as a function of the document and the term only implies that it is independent of other terms. • This is an important oversimplification. • But it allows for fast computation. • No study has shown that not assuming independence brings significant performance gain.
Boolean model • in the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • thus powerful queries can be written
example: a AND (b OR NOT c) • • 1 2 3 4 5 6 7 • • abc ab ac cb c b a
advantages of Boolean model • supposedly easy to grasp by the user • precise semantics of queries • implemented in the majority of commercial systems • why is it set-theoretic ?
problems of Boolean model • sharp distinction between relevant and irrelevant documents • no ranking possible • users find it difficult to formulate Boolean queries
http: //openlib. org/home/krichel Thank you for your attention!