Wednesday, May 28, 2008

Other side of the web world

I was searching the web for a book called "Nicole follows the sun". I had read this book some 10 years back. I don't remember the author, but I remember that it was a series. I just want to know if this series is still being published. It was a kinda porn novel where the lead character, Nicole, a break-free nymph, globe trot and experiance sex without any strings attached.


I was searching the web, if I could get any pdf or online tittle of that series since I lost the copy of that book I had.

Finally I was able to find the titles of the series, but unfortunately, no book or not much information about the author was available in the web (Morgan St. Michel). I was bit disappointed.


But, this article is not about the Nicole or Morgan St. Michel. While searching the web, through google (or should I say "googling" ?), I was wondering about the results I am getting and those results I am NOT getting.


Normally we google for the information we search in the net. But from where does google get those information? And if google can get everything that is in the net, why does certain pages dosen't show up whereas certain pages shows up.


To illustrate this, lets follow a simple procedure of googleing your own full name, and check the results. Certain pages / sites where you had registered might show up whereas certain pages may not.


Why?


Is that mean there is a web which is not being reached by normal search engines? How big is that web. Searching on this line, I had stumbled upon certain facts which I thought I can share with.

The Web & the Net

Basically, the words net & web are used interchangeably. Though there is no difference between these two for the end user who uses the web, technically speaking they ARE different.

To put it in a lay man’s term:
What we call as internet or simply net is a collection of interconnected computers across the globe by some connection method (details will be technical – I am omitting that part)

What we call as World Wide Web or simply web or even www is collection of websites residing on those computers.

To be precise, a computer in the internet might have more than one website running on it, hence, you shall we accessing them with two different names, but connected in the background with the same resource computer.

Surface Web & Deep Web

The web what we access is called as visible web or surface web. This is from where the information is being retrieved by the search engines / web crawlers like google. Search engines construct a database of the Web by using programs called spiders or Web crawlers that begin with a list of known Web pages. The spider gets a copy of each page and indexes it, storing useful information that will let the page be quickly retrieved again later. Any hyperlinks to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed. The collection of reachable pages defines the surface Web.

For various reasons some pages can not be reached by the spider. These 'invisible' pages are referred to as the deep Web.

The deep Web (also called Deepnet, the invisible Web, or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by search engines. It is estimated that the deep Web is several orders of magnitude larger than the surface Web.

In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes of data and 550 billion individual documents. Estimates – based on extrapolations from a study done at University of California, Berkeley – show that the deep Web consists of about 91,000 terabytes. By contrast, the surface Web (which is easily reached by search engines) is only about 167 terabytes. The Library of Congress contains about 11 terabytes.

Deep Web resources may be classified into one or more of the following categories:

Dynamic content – dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge.
Unlinked content – pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).
Private Web – sites that require registration and login (password-protected resources).
Contextual Web – pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).
Limited access content – sites that limit access to their pages in a technical way, prohibiting search engines from browsing them and creating cached copies.
Scripted content – pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX solutions.
Non-HTML/text content – textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

Access Deep Web

Since a large amount of useful data and information resides in the deep Web, search engines have begun exploring alternative methods to crawl the deep Web. Google’s Sitemap Protocol and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web.

Federated search by subject category or vertical is an alternative mechanism to crawling the deep Web. Traditional engines have difficulty crawling and indexing deep Web pages and their content, but deep Web search engines like CloserLookSearch, Science.gov and Northern Light create specialty engines by topic to search the deep Web. Because these engines are narrow in their data focus, they are built to access specified deep Web content by topic. These engines can search dynamic or password protected databases that are otherwise closed to search engines.

Dark Net

The terms dark internet or dark address refer to any or all the unreachable network hosts on the Internet.

It (Darknet) is a set of network connections using protocols other than HTTP but still on the public Internet, established in a closed and secretive way between trusted parties only, usually for the purposes of peer-to-peer file sharing.

Though some people sometimes refer to dark internet hosts as "dark webs" or "the dark web", the terminology is quite incorrect, as a web refers to a collection of interconnected documents, while an internet refers to a collection of computers

No comments: