Indexing Web done by Microsoft, Google, Yahoo
- Reject low-value automated content
- Ignore Web-accessible data
- No access to restricted content
Large search engines have many data centers around the world (clusters of commodity PCs, servers)
- crawling
- indexing
- query processing
- snippet generation
- link-graph computations
- result caching
- insertion of advertising content
400 terabytes of info to crawl
Crawling mechanism = queue of URLs, beginning with "seed"
Issues in crawling:
- Speed - use of internal parallelism
- Politeness - not bombing requests on one server
- Excluded content - communicate with robots.txt file
- Duplicate content - identify identical content with different URLs
- Continuous crawling - priority queue based on what is current and changing
- Spam rejection - manual AND automated analysis
Crawlers are highly complex and need to adapt
Web Search Engines: Part 2
Indexing algorithms use inverted files (concatenation of posting lists for each distinct term)
-Scans text for indexable terms and gives number
-Inverts by sorting into term number order
Issues with indexing:
- Scaling up - to be efficient
- Term lookup - all languages plus new terms
- Compression - takes less storage
- Phrases - precompute posting lists or create sublists
- Anchor text - link text provides info on destination
- Link popularity score - derived from frequency of incoming links
- Query-independent score - based on link popularity, URL brevity, spam score, frequency of clicks
Average query length = 2.3 words
-Need more than simple-query processor
Ways to speed things up:
- Skipping - irrelevant postings
- Early termination - once postings left are of little value
- Clever assignment of document numbers - decreasing query-independent score
- Caching - reduces cost of answering queries
These two articles provided an interesting view of how search engines work. I didn't realize that they actually index the pages that they crawl, but it makes sense. It's funny to think about the fact that computer scientists used to think this wouldn't be possible, back when there was only a fraction of the information on the Internet that we have today. I can't believe the amount of information that a search engine has to deal with, and while they may not be perfect yet, I can see how these techniques and algorithms make things fast and efficient and fairly accurate.
White Paper: The Deep Web: Surfacing Hidden Value
Deep Web = buried too far down (on dynamically generated sites) for standard search engines to find it
-need to be static and linked to other sites to be found
Deep Web content = in searchable databases, only produces results in response to a search
-7,500 terabytes of info
BrightPlanet = makes dozens of direct queries simultaneously with multi-thread technology
- Search engines: either author submits his/her site or engine "crawls" docs by moving between hyperlinks
- Google: crawls and indexes based on popularity of sites
- --If search engines depend on linkages, they'll never get to the deep Web
Factors in deep Web development:
- Database technology
- Commercialization through directories and e-commerce
- Dynamic serving of web pages
BrightPlanet = directed query engine, gets at deep Web
Deep Web = 10x greater amount of content than rest of Web
Deep Web site qualifications: 43,348 URLs
Database types in deep Web:
- Topic databases
- Internal site
- Publications
- Shopping/auction
- Classifieds
- Portals
- Library
- Yellow and white pages
- Calculators
- Jobs
- Message or chat
- General search
Deep Web content:
- Deep Web docs are27% smaller than surface Web docs
- Deep Web sites are much larger than surface Web sites
- Deep Web sites have about 50% more traffic than surface Web sites
- 97.4% of deep Web is publicly available
- Deep Web may be higher quality than surface Web
- Deep Web growing faster than surface Web
Needs to be a way to search info in deep Web = BrightPlanet?
This article was very interesting because I had previously heard a bit about the deep Web, but what I had read previously made the deep Web sound like a sinister place where most of the content was illegal and strange. Now I understand that the deep Web is mostly databases related to different organizations that are "buried" because there are no links that connect them to the rest of the Web, so conventional search engines cannot find them in the usual way. I hope that we keep developing our ability to access and search the deep Web because it sounds like there is a lot of information there that could be useful to people.
Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting
Open Archives Protocol for Metadata Harvesting = OAI-PMH
- Federates access to diverse e-print archives through metadata harvesting and aggregation
- Released in 2001, used by content management systems
- Mission: "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content"
- Uses XML, HTTP, and Dublin Core standards
- Data providers or repositories provide metadata
- Service providers or harvesters harvest the metadata
- Can provide access to invisible/deep Web
Notable community- or domain-specific services:
Open Language Archive Community
Sheet Music Consortium
National Science Digital Library
Comprehensive, searchable registry of OAI repositories
-more informative, searchable, and complete than in the past
-machine processing option
Future work of OAI registry:
- Enhance descriptions of repositories for search
- Provide automated maintenance of registry
- Delegate creation/maintenance of collections
- Improve view of search results
Extensible Repository Resource Locators = ERRoLs, "cool URLs," lead to content and services relating to an OAI repository
-simple mechanism to access OAI data
Challenges for OAI community:
- Metadata variation
- Metadata formats
- OAI data provider implementation practices
- Communication issues
Future directions: best practices, static repository gateway, Mod_oai Project, OAI-rights, controlled vocabularies and OAI, SRW/U-to-OAI gateway to the ERRoL service
If I understood this article correctly, this seems like another project attempting to get to the useful data that is in the deep Web, just like the BrightPlanet mentioned in the last article. This sounds like it is a open collaboration, however, between the people who have the data and the people who want to access it or give access to others. This seems like a good strategy and will be useful to people who want to use the deep Web data that is currently unavailable. There's still a lot of this I don't understand, but I hope that I'm generally correct in the overall idea that this article is presenting.

