2.) Kristine Hanna, Director of Web Archiving Services for the Internet Archive spoke second about their web harvester Heritrix. Kristine first introduced the Internet Archive as a non profit founded in 1996 dedicated to providing widespread and permanent access to digital information. It is also built on open source software, open source principles, and is an extensive organization dedicated to provide new and better access all the time for the end user. Heritrix is a part of the Internet Archive developed in 2003 as a tool developed to be an open source web harvester used in libraries and archives around the world to archive born-digital content. The Internet Archive and Heritrix work with partners and are able to improve their services this way. They work to do community web archiving, meaning they collect specifically for a specific disaster (i.e. Katrina, tsunamis) or event in order to ensure the information created is permanently preserved for access. The Library of Congress has been very supportive with topic/event crawls (i.e. National elections, Darfur, .gov etc.) and there have been a number of crawls through other national libraries (i.e. France, Denmark, etc.). They have taken a global snapshot of the web with “Around the world in 2 Billion Pages” and have also begun to crawl YouTube beginning with the most popular sites on the main page with more thematic crawls in the future. They worked on a project that selected several high schools and allowed the students to choose the site to be archived etc. Archive-it automates Heritrix within a web application for institutions with more limited resources allowing more control over frequency and scope. It is more selective and partners can choose to archive 1 site to as many as they want. These services are changing the definitions of what records are and who is responsible for them. They try to capture “it all” and narrow the search from there. They are interested in LOCSS and have a terabyte limit. Future projects include a project to called Crisis, Tragedy, and Prevention which archives sites associated with events like Virginia Tech, and a K-12 program similar to their high school program. One problem they face is that crawlers cannot necessarily capture the dynamics of things like drop down menus. Another issue they faced was eventually allowing for material, even public information, to have access restricted because of old information (for example health websites). They do place banners on their sites to notify the user the material is archival, but this doesn’t always prevent users from using the information.
Kristine’s PowerPoint for this presentation/discussion is available through the “Exchange Session Schedule w/ Presentation Descriptions” page at:
http://www.bpexchange.org/2008/presentations_chron.php
Facilitator - Emiley Jensen
