Open and self-sustaining digital library services the example

Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005 -06 -29

introduction • Title "Open and self-sustaining digital libraries" has been chosen before I was really aware of the need of the audience. • I read in the announcement that I am supposed to talk about "по информационному поиску и автоматической обработке текстов". This is area I don't know that much about but I hope to be asking some interesting questions. • I hope to find someone who is interested enough in some of them to work with me.

my background • I am a trained economist. An economist knows the price of everything and the value of nothing. • I am interested in free digital libraries. • "Free" can mean "бесплатный" or "свободный". I am interested more in the former than in the latter. • My work has mainly been on building such digital libraries. I am less concerned with the usage of such libraries. • The building and maintenance of the library will generate costs. How can it be given away for $0?

automation • Digital libraries could be entirely automated. • This is true if the purpose of the digital library is mainly to retrieve information. • Generally speaking, for information retrieval an automated system is quite sufficient. Examples are Google and Cite. Seer.

limit to automation • This comes in when the library is used to assess underlying facts. • If we say "Thomas Krichel wrote paper X" the computer will not understand who Thomas Krichel is. Only a human can know for sure. • When the library is used for evaluative purposes, it needs some controlled human intervention. • By evaluative purpose I mean to purpose to say how well a person or institution has behaved.

evaluative purpose • Seems vague but here are some evaluative issues in academic libraries – which journal is the most cited in field X? – who has written the most papers in field Y? – which institution has the most researchers in field Z? • Human intervention is critical because – identification problems that we have discussed – problem of abuse and fraud

why bother with evaluation? • For a self-sustaining freely available digital library, the problem of contribution is critical. • Providers of data will have good incentives, if the data that they contribute is used to evaluate performance. • In academic digital libraries a crucial ingredient that helps performance is visibility. Publish (in the sense of "make public) or perish quite literally.

role of automated means • Ideally a digital library will use a mixture of automated and human activity. • We push automation as far as we can, and let humans do the rest. • The design and successful implementation of such digital libraries is a complex long-run task. • It can be helped if the digital library is also open.

Example: Re. PEc • This is what I am most famous for. I founded the Re. PEc digital library. In fact its creation in 1997 goes back to efforts that I made as early as 1993. • Re. PEc is a digital library that aims to document keys aspect of the discipline of Economics. • It is essentially a metadata collection. But it goes beyond document+collections metadata to collect data about academic authors and institutions. • These data on authors and institutions stand in relation to the document metadata.

Re. PEc is based on 440+ archives • • • Wo. PEc Econ. WPA DEGREE S-Wo. PEc NBER CEPR • • • US Fed in Print IMF OECD MIT University of Surrey CO PAH

to form a 300+k item dataset 146, 000 working papers 154, 000 journal articles 1, 600 software components 900 book and chapter listings 6, 400 author contact and publication listings 8, 400 institutional contact listings

Re. PEc is used in many services • Econ. Papers • NEP: New Economics Papers • Inomics • Re. PEc author service • Z 39. 50 service by the DEGREE partners • • • IDEAS Ru. PEc EDIRC Log. Ec Cit. Ec

institutional registration • This works through a system called EDIRC. • Christian Zimmermann started it as a list of departments that have a web site. • I persuaded him that his data would be more widely used if integrated into the Re. PEc database. • Now he is a crucial Re. PEc leader.

Log. Ec • It is a service by Sune Karlsson that tracks usage of items in the Re. PEc database – abstract views – downloads • There is mail that is sent by Christian Zimmermann to – archive maintainers – RAS registrants that contains a monthly usage summary.

authors' incentives • Authors perceive the registration as a way to achieve common advertising for their papers. • Author records are used to aggregate usage logs across Re. PEc user services for all papers of an author. • Stimulates a "I am bigger than you are" mentality. Size matters!

NEP: New Economics Papers • NEP is a current awareness service for new working papers in Re. PEc. • Working papers are accounts of recent research findings prior to formal publications. • Formal publication takes about four years in Economics, so no formal paper is new.

NEP reports • NEP is a collection of subject-specific report. • Each report is a serial. It has issues, usually every week. • Each report has – code e. g. nep-mic – subject e. g. microeconomics – editor, i. e. human who controls the contents. • A special NEP report, nep-all, contains all new papers.

history • Initially, I opened NEP in 1998. John S. Irons agreed to be the general editor. • The general editor is the person who – prepares nep-all – overlooks the lists • In early 2005, the command structure was changed to – general editor who prepares nep-all – managing director who opens new reports and communicates to the editors – controller who watches what editors are doing

edition control • In the years 1999 to 2001 I took a rather peripheral interest in NEP. At this time many reports developed long editorial delays or where not issued at all. • Despite that the number of reports did still grow. • But there is no organization of reports into line of subject in economics. • The report subject space is linear, with most subjects being covered.

coverage ratio analysis • In a paper by Krichel & Bakkalbasi, there is an effort to analyze the coverage ratio of NEP issues. This is the ratio of papers in NEP-all that make it to at least one subject report. • Historical data shows the mean coverage ratio is not improving over time. Rather it stays constant at around 70%. • There are two theories that can help to explain the static nature of the coverage ratio.

coverage ratio theory I: target size • When editors compose the subject report, they have an implicit report size in mind. When nep-all is large, then the editors will be more selective. That is, they will take a narrow view of the subject area. • The chances of a paper to be included in the subject report are likely to be smaller when a nepall issue is large.

coverage ratio theory II: quality • Papers in Re. PEc have different quality. • Some papers have problems with "substantive quality" – come from authors that are unknown – come from institutions that have an unenviable research reputation – appear in collections that are unknown. • Some papers have problems with "descriptive quality". – not in English – no abstract

empirical study • Krichel & Bakkalbasi investigate this by using a binary logistic regression analysis. This estimates, for every paper that appeared in nepall, the probability that is will get announced in any subject report. • They find support for both target size and quality theories. There is strong empirical support that the series matters. There is also some empirical support that author prolificacy matters. • These results have been greeted with protests by the editors, who claim that they only consider the subject when making decision.

pre-sorting reports • As Re. PEc is growing the growing size of nep-all threatens the survival of NEP. • Editors simply don't want the cope with it. • In 2001 I developed an idea to pre-sort the report for the editors. A computer program would look at past issues of the report, extract features, and make forecasts about the most likely papers. • Editors would then only need to look at the top part of the pre-sorted nep-all issue, not at the bottom.

current state of play • I extract the following features – – – author names title abstract keyword journal of economic literature (JEL) classifications series • I remove punctuation, lowercase, normalize using L 2 • I submit the result to svm_light for classification. • I test using 300 record, and use the rest for

How well am I doing? • This is not a trivial question. Precision and recall are useless. It matters what documents are judged relevant by the system. Only the ordering matters. We know the best and worst outcomes. • Some measures have been proposed that do take ordering. But they still need to be applied to our case. • Ideally I have a measure that will evaluate instant outcomes and that have some normalization properties – The value of the measure at the best outcome should be 1.

the hiking measure • One measure that I have developed is what I call the hiking measure. – I define a steps as a permutation of two documents in the outcome vector. – I the number of steps that it takes, from an outcome x to be evaluated, to the best outcome as s(x) – Then the hiking measure h(x) = 1 – 2 s(x) / n / ( n – r) – where n is the total number of documents and r is the number of relevant documents.

example r=2 n=5 • Here is the complete table and outcome x 1, 1, 0, 0, 0 1, 0, 0 0, 1, 1, 0, 0, 1, 0 h(x) 1. 0 2/3 1/3 0. 0 x 1, 0, 0, 0, 1, 1, 0 0, 1, 0, 1 0, 0, 0, 1, 1 h(x) 0. 0 -1/3 -2/3 -1. 0 • Problems: – no strict ordering different outcomes have the same hikes – violation of a "natural order of outcomes"

natural order • A conscientious editor will be concerned by how low the last relevant paper sinks. Thus comparing two outcomes, the one that has the last relevant paper at a lower position should be preferred. • If two outcomes have the last relevant paper at the same position, the second-to-last paper relevant paper should be compared. • This leads to a complete ordering of outcomes.

conjecture • A rational editor faces two penalities when composing the report. – examine a new paper – risk loosing a relevant paper • I claim that under a large class of formulation of the editor's choice, ranking outcomes by the natural order is consistent with minimizing the loss experienced by the editor. • But I can not show this.

one way for the computational implementation of natural order • Derive an algorithm that will associate consecutive natural numbers with each of the outcomes, ordered by the natural order. • The expected value is then trivial to compute, and a measure can be defined. • Does anyone know such an algorithm?

a more flexible way for the computational implementation of natural order • Pick y > 1 • Then evaluate any outcome as – – sum(y**p)*i, where p is the position, starting from the right i=1 if relevant i=0 if not • example: for y=2, interpret x as a binary number • example for y=3, – 0 1 1 0 0 --> 3**1*0+3**2*0+ 3**3*1+3**4*1+3**5 • Does anybody know the expected value?

outcome: average hike, 30 trials exp 98. 66 cis 98. 35 spo 96. 08 ets 95. 75 tra 95. 61 hea 95. 50 dcm 94. 76 geo 94. 56 int 94. 43 ecm 94. 27 gth 94. 09 dge 92. 94 mon 92. 54 eff 91. 48 ene 91. 46 ifn 90. 64 ino 90. 31 cba 90. 04 fmk 89. 90 ure 89. 86 hpe 88. 91 agr 88. 89 evo 87. 90 law 87. 84 env 87. 22 cul 86. 39 cbe 85. 76 ent 85. 07 com 84. 52 net 84. 20 edu 83. 80 lab 83. 58 dev 83. 55 cfn 82. 84

some remarks • There is a great diversity in the results. • Some topics are more easy to classify automatically than others. The value of the report lies in what the human says that goes beyond the recognition by the machine. • Unfortunately, manual inspection of poorly forecasted results suggests that the reason for the poor result may lie more in the inconsistency of editor decision making than in the forecasting technique. • This suggests that this could be used as

how to improve • Clearly word ordering is important in this areas since different classes don't differ that much by word choice. • I can use all the keyword data in the Re. PEc database to find phrases to add to my feature set. • There may also be a way to automatically deduct significant word combinations from titles and abstracts. • Finally a combination with the quality criteria mentioned may be good but it does not appear obvious how to do it.

conclusions • To provide high quality digital library services, human intervention still appears to be desirable. • However, we need ways to monitor how well the humans are doing. If they take bad decisions • Forecastability can be one criterion. • Timeliness and usage can be others. • I will have to work further to develop better monitoring systems for editor behavior.

http: //openlib. org/home/krichel спасиьо до внимание!