Thursday, March 29, 2012

Week 12 Reading Notes

Web Search Engines: Part 1

Indexing Web done by Microsoft, Google, Yahoo
  • Reject low-value automated content
  • Ignore Web-accessible data
  • No access to restricted content
Large search engines have many data centers around the world (clusters of commodity PCs, servers)
  • crawling
  • indexing
  • query processing
  • snippet generation
  • link-graph computations
  • result caching
  • insertion of advertising content
400 terabytes of info to crawl

Crawling mechanism = queue of URLs, beginning with "seed"

Issues in crawling:
  1. Speed - use of internal parallelism
  2. Politeness - not bombing requests on one server
  3. Excluded content - communicate with robots.txt file
  4. Duplicate content - identify identical content with different URLs
  5. Continuous crawling - priority queue based on what is current and changing
  6. Spam rejection - manual AND automated analysis
Crawlers are highly complex and need to adapt

Web Search Engines: Part 2

Indexing algorithms use inverted files (concatenation of posting lists for each distinct term)
-Scans text for indexable terms and gives number
-Inverts by sorting into term number order

Issues with indexing:
  1. Scaling up - to be efficient
  2. Term lookup - all languages plus new terms
  3. Compression - takes less storage
  4. Phrases - precompute posting lists or create sublists
  5. Anchor text - link text provides info on destination
  6. Link popularity score - derived from frequency of incoming links
  7. Query-independent score - based on link popularity, URL brevity, spam score, frequency of clicks
Average query length = 2.3 words
-Need more than simple-query processor

Ways to speed things up:
  • Skipping - irrelevant postings
  • Early termination - once postings left are of little value
  • Clever assignment of document numbers - decreasing query-independent score
  • Caching - reduces cost of answering queries
These two articles provided an interesting view of how search engines work. I didn't realize that they actually index the pages that they crawl, but it makes sense. It's funny to think about the fact that computer scientists used to think this wouldn't be possible, back when there was only a fraction of the information on the Internet that we have today. I can't believe the amount of information that a search engine has to deal with, and while they may not be perfect yet, I can see how these techniques and algorithms make things fast and efficient and fairly accurate.


White Paper: The Deep Web: Surfacing Hidden Value

Deep Web = buried too far down (on dynamically generated sites) for standard search engines to find it
-need to be static and linked to other sites to be found

Deep Web content = in searchable databases, only produces results in response to a search
-7,500 terabytes of info

BrightPlanet = makes dozens of direct queries simultaneously with multi-thread technology

  • Search engines: either author submits his/her site or engine "crawls" docs by moving between hyperlinks
  • Google: crawls and indexes based on popularity of sites
  • --If search engines depend on linkages, they'll never get to the deep Web
Factors in deep Web development:
  1. Database technology
  2. Commercialization through directories and e-commerce
  3. Dynamic serving of web pages
BrightPlanet = directed query engine, gets at deep Web

Deep Web = 10x greater amount of content than rest of Web

Deep Web site qualifications: 43,348 URLs

Database types in deep Web:
  1. Topic databases
  2. Internal site
  3. Publications
  4. Shopping/auction
  5. Classifieds
  6. Portals
  7. Library
  8. Yellow and white pages
  9. Calculators
  10. Jobs
  11. Message or chat
  12. General search
Deep Web content:
  • Deep Web docs are27% smaller than surface Web docs
  • Deep Web sites are much larger than surface Web sites
  • Deep Web sites have about 50% more traffic than surface Web sites
  • 97.4% of deep Web is publicly available
  • Deep Web may be higher quality than surface Web
  • Deep Web growing faster than surface Web
Needs to be a way to search info in deep Web = BrightPlanet?

This article was very interesting because I had previously heard a bit about the deep Web, but what I had read previously made the deep Web sound like a sinister place where most of the content was illegal and strange. Now I understand that the deep Web is mostly databases related to different organizations that are "buried" because there are no links that connect them to the rest of the Web, so conventional search engines cannot find them in the usual way. I hope that we keep developing our ability to access and search the deep Web because it sounds like there is a lot of information there that could be useful to people.


Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting

Open Archives Protocol for Metadata Harvesting = OAI-PMH
  • Federates access to diverse e-print archives through metadata harvesting and aggregation
  • Released in 2001, used by content management systems
  • Mission: "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content"
  • Uses XML, HTTP, and Dublin Core standards
  • Data providers or repositories provide metadata
  • Service providers or harvesters harvest the metadata
  • Can provide access to invisible/deep Web
Notable community- or domain-specific services:
Open Language Archive Community
Sheet Music Consortium
National Science Digital Library

Comprehensive, searchable registry of OAI repositories
-more informative, searchable, and complete than in the past
-machine processing option

Future work of OAI registry:
  • Enhance descriptions of repositories for search
  • Provide automated maintenance of registry
  • Delegate creation/maintenance of collections
  • Improve view of search results
Extensible Repository Resource Locators = ERRoLs, "cool URLs," lead to content and services relating to an OAI repository
-simple mechanism to access OAI data

Challenges for OAI community:
  1. Metadata variation
  2. Metadata formats
  3. OAI data provider implementation practices
  4. Communication issues
Future directions: best practices, static repository gateway, Mod_oai Project, OAI-rights, controlled vocabularies and OAI, SRW/U-to-OAI gateway to the ERRoL service

If I understood this article correctly, this seems like another project attempting to get to the useful data that is in the deep Web, just like the BrightPlanet mentioned in the last article. This sounds like it is a open collaboration, however, between the people who have the data and the people who want to access it or give access to others. This seems like a good strategy and will be useful to people who want to use the deep Web data that is currently unavailable. There's still a lot of this I don't understand, but I hope that I'm generally correct in the overall idea that this article is presenting.

Lab 11



Google Scholar query:
~virtual reference "digital libraries"
[also specified articles published between 2008 and 2012]

Google Scholar screenshot:



Web of Knowledge query:
Topic=(virtual reference) OR Topic=(digital libraries) AND Year Published=(2008-2012)

Web of Knowledge screenshot:

Wednesday, March 28, 2012

Week 11 Reading Notes

Digital Libraries: Challenges and Influential Work

Current info environment includes: "full-text repositories maintained by commercial and professional society publishers; preprint servers and Open Archive Initiative (OAI) provider sites; specialized Abstracting and Indexing (A & I) services; publisher and vendor vertical portals; local, regional, and national online catalogs; Web search and metasearch engines; local e-resource registries and digital content databases; campus institutional repository systems; and learning management systems"

Need more than access for digital library work - need federated search

History of digital libraries:
  • Digital Libraries Initiative (DLI-1), 1994
  • DLI-2, 1998
  • University-led projects
  • Development strongly influenced by evolution of Internet
  • Search interoperability and federated searching
Federation solutions: aggregated search or broadcast searching against remote resources
-Google, Google Scholar, OAI = aggregated/harvested
-Ex Libris Metalib, Endeavor Encompass, and WebFeat = broadcast search
-can be complementary

Metadata searching vs. full-text searching?

This article was a good, brief introduction to the issues surrounding federated search and why such a mechanism is necessary. I can't imagine how complicated it is to try to design a search that will encompass all of the different resources available online. It seems like it would be impossible to design something that would work with all the different systems that exist, but I also see the need for it in order to provide the best possible in digital library services.


Dewey Meet Turing: Librarians, Computer Scientists, and the Digital Libraries Initiative

DLI led to development of Google, as well as CareMedia and many others

Computer scientists: expected their research to impact daily lives
Librarians: expected grant money and impact on scholarship

Expected to be collaboration between computer scientists and librarians, but World Wide Web got in the way
-variety of media, larger collection, different access methods
-blurred consumers/producers of info
-split up collections over the world and under different owners

Computer scientists embraced changes Web created
Librarians felt threat to their traditional practice

Problems for librarians:
  • Loss of cohesive "collections"
  • High prices of journal publishers
  • Copyright issues
  • Dead links
Librarians expected more collection development
Computer scientists feel librarians too nitpicky about metadata
-However, core function of librarianship remains
-Notion of collections in reemerging (hubs)
-Opportunities for direct connections between librarians and scholarly authors

This article provided an interesting account of the tensions between librarians and computer scientists involved in the DLI. I can understand how these two professions planned to work together to create digital libraries, but that the Internet changed everything, as it has in so many areas. I can see how computer scientists and librarians have different perspectives and goals, but I also am glad that the author sees hope for the future of these professions working together and also of the practice of collection development.


Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age

Libraries taking more active role in promoting scholarship and scholarly communication

Supporting this strategy:
  • Lower online storage costs
  • Open archives metadata harvesting
  • Free, publicly accessible journal articles
MIT and DSpace institutional repository system

Institutional repository = set of services university offers for management and dissemination of digital materials created by institution and community members
-Preservation
-Organization
-Access/distribution

Contains:
  • Intellectual works by faculty and students
  • Documentation of activities of institution
  • Experimental and observation data
Scholarly publishing = specific example of scholarly communication

Authorship in digital medium
-traditional journal articles or new forms

Institutional repositories can help scholars with system administration activities and content curation
-problem with preservation

Traditional publishing = new supplementary datasets and analysis tools

Institutional repositories can:
  • enhance access
  • encourage new forms of scholarly communication
  • maintain stewardship of data
  • preserve supplemental info
  • curate records of institutional activity
Potential dangers:
  • Institutions could take control instead of scholars
  • Weighed down with policy
  • Lack of institutional commitment
  • Technical problems
Need infrastructure standards in: preservable formats, identifiers, and rights documentation and management

Future developments:
Consortial or cluster institutional repositories
Curatorial and policy control
Federating institutional repositories
Community or public repositories

I like the way that this article outlines the opportunities and responsibilities of an institutional repository. It seems to me that every institution such as a university should have such a repository in order to organize and preserve digital information that could be important in the future. It would be against an institution's mission to lose some of its vital records and/or intellectual work and have to reinvent the wheel all the time or have a limited knowledge of past activities. It will be interesting to see what happens in the future of institutional repositories and if the author of this article is correct.


Sunday, March 18, 2012

Week 9 Lab

URL: http://www.pitt.edu/~tgs11/lab9.html

Week 10 Reading Notes

Introduction to XML (IBM)

XML = Extensible Markup Language, can create own tags, machine can read it

XML based on SGML
  1. Tags = text between brackets
  2. Elements = starting tag, ending tag, everything in between
  3. Attributes = name-value pair inside starting tag
XML can:
  • simplify data interchange
  • enable smart code
  • enable smart searches

3 kinds of XML documents:

  1. Invalid docs = don't follow syntax rules of element or DTD
  2. Valid docs = follows both XML and DTD rules
  3. Well-formed docs = follow XML syntax rules but don't have DTD rules
Need a single root element
Elements can't overlap
End tags required
Elements are case sensitive
Attributes must have quoted values
XML declarations
Also: comments, processing instructions, entities

Use namespaces to specify tags

DTD = document type definitions, specifies basic structure of XML doc
-some elements must appear, must appear in a certain order
-elements must contain text
-use of certain symbols

DTD can:
  • define which attributes are required
  • define default values for attributes
  • list all of valid values for given attribute
XML schemas:
use XML syntax
support datatypes
are extensible
have more expressive power

Programming interfaces:
  1. Document Object Model
  2. Simple API for XML (SAX)
  3. JDOM
  4. Java API for XML Parsing (JAXP)
XML Standards = determined by w3
-XML schema: primer, doc structures, data types
-XSL, XSLT, XPath = formatting standards
-XLink and XPointer = linking and referencing standards

Web services: SOAP, WSDL, UDDI

This was a good overview of XML. I have also learned about XML in other classes, and I can see how it would be very useful and could lead to the goal of the semantic web. I can also see how it could be complicated, though, and that standardization still needs to be clarified. There also seems to be a need to get all organizations from all over the world to agree on these standards, and that can be a difficult compromise to reach.


A survey of XML standards: Part 1

Core XML technologies that are standards

XML
XML 1.0 (2nd ed.) = builds on Unicode
XML 1.1 = first revision
-Recommended intros/tutorials
-References

Catalogs
XML Catalogs = governed by RFC 2396: Uniform Resource Identifiers, RFC 2141: Uniform Resource Names
-entity
-entity catalog
-system identifiers
-URIs
-URNs
-public identifiers
OASIS Open Catalog
-Recommended intros/tutorials

XML Namespaces
Namespaces in XML 1.0
-XHTML
Namespaces in XML 1.1
-Resource Directory Description Language (RDDL)
-RDF
-TAG
-XLink
-Recommended intros/tutorials
-References

XML Base
XML Base
-Recommended intros/tutorials

XInclude
XML Inclusions (XInclude) 1.0
-Recommended intros/tutorials

XML Infoset
XML Information Set
-information items
-Recommended intros/tutorials

Canonical XML (c14n)
Canonical XML Version 1.0
-Exclusive XML Canonicalization Version 1.0

XPath
XML Path Language (XPath) 1.0
-XSLT
-W3C XML schema
-Recommended intros/tutorials

XPointer
XPointer Framework
-xpointer() scheme
-element() scheme
-xmins() scheme
-FIXptr
-Recommended intros/tutorials

XLink
XML Linking Language (XLink) 1.0
-HLink
-simple links
-extended links
-linkbases
-Recommended intros/tutorials
-References

RELAX NG
RELAX NG
-XML schema
RELAX NG Compact Syntax
-Document Schema Definition Languages (DSDL)
-Recommended intros/tutorials
-References

W3C XML schema
XML Schema Part 1: Structures
XML Schema Part 2: Datatypes
-Recommended intros/tutorials
-References

Schematron
Schematron Assertion Language 1.5
-Recommended intros/tutorials
-References

Standards made by:
  • W3C
  • International Organization for Standardization (ISO)
  • Organization for the Advancement of Structured Information Standards (OASIS)
  • Internet Engineering Taskforce (IETF)
  • XML community
This article has a lot of information on XML standards, and I think it would be very useful if I needed to get more in-depth with XML and/or its standards. The tutorials and references seem very useful, but I would need to go back to the page to get the names and links of each one. This would be a good reference for the future.


XML Schema Tutorial (w3schools.com)

XML schema = describes structure of XML document
= is XML-based alternative to DTD
= also XML Schema Definition (XSD)

Need to know:
  • HTML/XHTML
  • XML and XML Namespaces
  • Basic understanding of DTD

XML Schema defines:

  1. elements that can appear in a document
  2. attributes that can appear in a document
  3. which elements are child elements
  4. order of child elements
  5. number of child elements
  6. whether an element is empty or can include text
  7. data types for elements and attributes
  8. default and fixed values for elements and attributes

XML Schema is W3C recommendation

If data types supported, it is easier to:

  • describe allowable document content
  • validate the correctness of data
  • work with data from a database
  • define data facets (restrictions on data)
  • define data patterns (data formats)
  • convert data between different data types

XML schemas use XML syntax because: don't need to learn new language, can use XML editor and parser, can manipulate with XML DOM, can transform with XSLT

XML schemas secure data communication
-are extensible
-well-formed is not enough

Well-formed:
  • it must begin with the XML declaration
  • it must have one unique root element
  • start-tags must have matching end-tags
  • elements are case sensitive
  • all elements must be closed
  • all elements must be properly nested
  • all attribute values must be quoted
  • entities must be used for special characters

XML documents can have reference to DTD or XML Schema

Note element is complex type, other elements are simple types

< "schema" > element is root of every XML schema
-may contain some attributes
-doc can reference XML schema
-can specify default namespace
-can use schemaLocation attribute
  • XSD simple elements, attributes, and restrictions/facets
  • -only text
XSD complex types = contains other elements/attributes
-empty elements have no content
-can contain only elements
-complex text only
-can be mixed text and other
-order, occurrence, and group indicators
-any or anyAttribute
-substitution

Data Types
  1. string
  2. date
  3. numeric
  4. misc.
  • Schema references
I really like these W3C tutorials, and this one gave me a lot of good information about XML schemas. I would be able to go back and get more deep information about the topic, but these notes will help me remember what is covered in the tutorial and some basic definitions. I think XML schemas could be very useful, and I don't totally understand them still, but it would be a good topic of investigation for the future.

Monday, March 5, 2012

Week 9 Reading Notes

HTML5 Tutorial - w3schools

HTML5 = new standard for HTML, cooperation between the World Wide Web Consortium (W3C) and the Web Hypertext Application Technology Working Group (WHATWG)

Rules for HTML5:
  • New features should be based on HTML, CSS, DOM, and JavaScript
  • Reduce the need for external plugins (like Flash)
  • Better error handling
  • More markup to replace scripting
  • HTML5 should be device independent
  • The development process should be visible to the public
!DOCTYPE html
html
head
body

New features:

  • The canvas element for 2D drawing
  • The video and audio elements for media playback (also source, embed, track)
  • Support for local storage
  • New content-specific elements, like article, footer, header, nav, section (and more)
  • New form controls, like calendar, date, time, email, url, search (datalist, keygen, output)
  • Removed: acronym, applet, basefont, big, center, dir, font, frame, frameset, noframes, strike, tt, u
Defines a new element which specifies a standard way to embed a video/movie on a web page: the video element
-Certain browsers support
-Also has methods, properties, and events

Defines a new element which specifies a standard way to embed an audio file on a web page: the audio element
-Certain browsers support
-Control attribute adds audio controls, like play, pause, and volume

Drag and drop is part of the standard, and any element can be draggable

Canvas element used to draw graphics, on the fly, on web page (usually JavaScript)
-only a container for graphics
-several methods for drawing paths, boxes, circles, characters, and adding images

SVG=
  • Stands for Scalable Vector Graphics
  • Used to define vector-based graphics for the Web
  • Defines the graphics in XML format
  • Graphics do NOT lose any quality if they are zoomed or resized
  • Every element and every attribute in SVG files can be animated
  • W3C recommendation
SVG advantages:
  • Images can be created and edited with any text editor
  • Images can be searched, indexed, scripted, and compressed
  • Images are scalable
  • Images can be printed with high quality at any resolution
  • Images are zoomable (and the image can be zoomed without degradation)
Geolocation API is used to get the geographical position of a user
-position not available unless user approves it
-getCurrentPosition

Web pages can store data locally within the user's browser
-stored in key/value pairs, and a web page can only access data stored by itself

Application cache = a web application is cached, and accessible without an internet connection
Advantages:
  1. Offline browsing - users can use the application when they're offline
  2. Speed - cached resources load faster
  3. Reduced server load - the browser will only download updated/changed resources from the server
Web Worker = executing scripts in an HTML page, the page becomes unresponsive until the script is finished
Server-sent Event = web page automatically gets updates from a server (onopen, onmessage, onerror)

HTML5 Forms, Reference, Tags

It's very interesting to see the updates to HTML in HTML5. It really seems like this new standard takes into account the way that the internet works today and will create more flexibility for HTML developers in the present and future. It does seem overwhelming since I don't feel like I have a grasp of the previous standard yet, but hopefully I can learn what I need to about HTML5.



HTML5 - Wikipedia page

HTML5 = language for structuring and presenting content, originally proposed by Opera Software
=Fifth revision, still in development in March 2012
=response to the observation that the HTML and XHTML in common use on the World Wide Web are a mixture of features introduced by various specifications

Many new features, like video, audio, canvas elements
Designed for multimedia graphical content

APIs and DOM are fundamental part
No longer based on SGML

New APIs:
  • canvas element for immediate mode 2D drawing. See Canvas 2D API Specification 1.0 specification
  • Timed media playback
  • Offline Web Applications
  • Document editing
  • Drag and drop
  • Cross document messaging
  • Browser history management
  • Mime Type and protocol handler registration
  • Microdata
  • Web Storage, a key-value pair storage framework that provides behaviour similar to Cookies but with larger storage capacity and improved API
XHTML5 = XML serialization of HTML5

HTML5 = flexible in handling incorrect syntax, new error handling

Differences between old and new:
  1. New parsing rules
  2. Inline SVG and MathML
  3. Many new elements
  4. New types of form controls
  5. New attributes
  6. Global attributes
  7. Depracated elements dropped
This Wikipedia article gave a good background on what HTML5 is, although it did repeat some of the previous article from w3schools. Some of the terms are more complicated than I am able to understand, but the differences that exist between the previous standard and this new one are clear and fairly easy to understand. It's interesting to learn about the future of HTML as it's being developed.



XHTML - w3schools

XHTML=
  • stands for EXtensible HyperText Markup Language
  • almost identical to HTML 4.01
  • stricter and cleaner version of HTML
  • HTML defined as an XML application
  • W3C Recommendation of January 2000
  • supported by all major browsers
XML = markup language where documents must be marked up correctly and "well-formed"

Important differences from HTML:
  1. XHTML elements must be properly nested
  2. XHTML elements must always be closed
  3. XHTML elements must be in lowercase
  4. XHTML documents must have one root element
More syntax rules:
  • Attribute names must be in lower case
  • Attribute values must be quoted
  • Attribute minimization is forbidden
  • The XHTML DTD defines mandatory elements
!DOCTYPE is mandatory, not an XHTML tag; instruction to the web browser about what version of the markup language the page is written in
-then head and body

This article gave me a good introduction to XHTML, although now I feel a bit overwhelmed with all the different markup languages that we've covered in our readings. I know that they all have differences that set them apart from the others, but I think that for someone who doesn't have experience with them, they all start to look the same. I think that maybe if I started using them I would have a better understanding of the differences between them. As I understand it, XHTML is like a combination of XML and HTML.