Can’t Find Your 404?

March 13, 2009
6:30 pmto7:30 pm

Missing web pages (pages that return the “404 Page Not Found” error) are part of the browsing experience. So too are pages whose owners failed to renew their domain and whose old URIs now have unexpected content. As the web grows and users multiply, the missing page problem becomes more common. Michael L. Nelson, an associate professor of computer science at Old Dominion University, is working to solve this problem and offers his thoughts during Frito Friday on March 13.

Users that encounter a missing page or unexpected page may try to use search engines to discover either the same page at a new location or a similar, “good enough” page to satisfy their information needs, but this can be laborious. To address this need, Michael L. Nelson and researchers at old Dominion University are developing a semi-automated framework to assist users to first discover the topic of the missing page, and then locate the same or similar page at a new URI.

He investigating a number of techniques to discover the “aboutness” of an unknown web page. If the page is in the Internet Archive’s Wayback Machine or in a search engine cache, the user may be satisfied with the old copy. If an old copy is insufficient, we can use the either the page’s title or generate a lexical signature to serve as a query to a search engine to find the resource. (A lexical signature is a 5-7 word “abstract” of a document that is suitable for using as a query to a search engine.) Titles and lexical signatures are comparably reliable, with both achieving over 60% success. The combination of titles and lexical signatures yields at least 75% success. Michael has investigated forming lexical signatures from link neighborhoods as well as using tags from del.icio.us, but at this point neither method performs well.

michaelnelson.jpgMichael is an associate professor of computer science at Old Dominion University. Prior to joining ODU, he worked at NASA Langley Research Center from 1991-2002. He is a co-editor of the OAI- PMH and OAI-ORE specifications and is a 2007 recipient of an NSF CAREER award. He has developed many digital libraries, including the NASA Technical Report Server. His research interests include repository-object interaction and alternative approaches to digital preservation. More information about Dr. Nelson can be found here.