a449f6ac9c468e1bc7eb612e198877fc.ppt
- Количество слайдов: 8
Harvesting e-publications in DK – a short status January 2015 By Tue Hejlskov Larsen, netarchive. dk
E-books/E-Sound/SMS-books – E-publications Today we don’t know exactly how big the e-publication area is. E-publications (in pdf, mp 3 or e-pub format) with or without ISBN/ISSN numbers are published today: o in parallel using different channels/publishers o many of them through the biggest danish e-pub publisher Publizon. dk o directly to the internet using the author’s own home page or through one of the many very small e-publishers with 10 or 2 -400 e-books like http: //gopubli. sh/. o directly to the webshops channels e. g. saxo. com in DK or through international sales channels like amazon. com or other foreing located web domains.
Currently active pilot projects with publishers o Museum Tusculanum ( about 700 titles) o Publizon. dk ( I guess about 75 % of the ”normal” commercial e-books/e-sound-books) in numbers about 20. 000 e-books and 6. 000 e-sound-books) o Smspress. dk (about 100)
Next step o OAI-pmh harvesting with NAS of all research libraries and some public institutions using NAS Heritrix OAI extracter module (The aau. dk University is succesfully OAI harvested with some added filters and there was collected about 12. 000 pdf’s - allmost the same as Danish National Research Database has information about) o One or two commercial webshops
Technical solutions 1 o Focused NAS harvesting of universities, regions, hospitals, city governments and other public institutions like f. x. Statstidende. dk E. g. by harvesting aau. dk we found about 166. 000 pdf files - the so called ”gray/dark e-publication area”– teaching materials, brochures, instructions mixed up with published journal articles, e-books and a lot of duplicates. Metadata and files are in the same harvest and stored in the netarchive. o SMSBooks Metadata and SMS-books using smspress. dk API in ONIX-format with some new addon extentions for SMS-books. The netarchive. dk paided for the software development at Smspress. Metadata and SMSBooks is stored outside the netarchive. o Museum Tusculanum: OAI-pmh harvesting using NAS OAI extractor module ( includes metadata and pdf/e-pub-files in same harvest and stored in the netarchive). The netarchive. dk paided for the software development at Museum Tusculanum.
Technical solutions 2 o Publizon: a) Metadata about e-books and e-sound-books are extracted from Publizon API and stored outside the netarchive. b) e-book files harvested from ftp: //ftp. pubhub. dk using NAS ftp orderxml and stored in the netarchive: <!-- Heritrix processor for archiving FTP-data behind password (FR 1896). The username/password values is for ftp. pubhub. dk --> <new. Object name="FTP" class="org. archive. crawler. fetcher. Fetch. FTP"> <boolean name="enabled">true</boolean> <new. Object name="FTP#decide-rules” class="org. archive. crawler. deciderules. Decide. Rule. Sequence"> <map name="rules"> </map> </new. Object> <string name="username">XXXXXXX</string> <string name="password">XXXXXXX</string> <boolean name="extract-from-dirs">true</boolean> <boolean name="extract_parent">false</boolean> <long name="max-length-bytes">0</long> <integer name="fetch-bandwidth">0</integer> <integer name="timeout-seconds">1200</integer> </new. Object>
Technical solutions 3 o Publizon (continued): c) e-sound files harvested from ftp: //ftp. pubhub. dk using wget and stored outside the netarchive. Here is the wget command: wget -m -X /*/Splitted/ -A *. mp 3 ftp: //XXXX: XXXXXX@ftp. pubhub. dk
And not to forget - the growing number of standalone deliveries o We get a growing number of emails with links to e-publications or attached files together with some information. n The links are mostly harvested and stored in the netarchive. n The attached publications and metadata are stored outside the netarchive (about 300 -400 folders)
a449f6ac9c468e1bc7eb612e198877fc.ppt