How Many Pages Are There On The WWW?

REVIEW Page

Below is the entire module on one page.

cartoon of earth and a computer screen both filled with numbers symbolizing the number of pages on the world wide web.

 

More Pages Than You Can Count?

No one knows the exact number of pages freely available on the web. New, unique, publicly accessible pages (aka the public web) are created every second. Given the enormous amount of information available on both the public web and the invisible web, it is clear that the careful researcher should investigate the hidden resources of the invisible web, and always use more than a single search engine. Since there is no central counting house or even a standard way of creating web pages, we can only make an educated guess at the number of web pages there are on the Internet. Several credible studies have tackled the problem.

Two Estimates

Cyveillance.com, a business intelligence gathering firm, attempted to count the pages on the web.   They then published a white paper called Sizing the Internet. Using their proprietary Net SapienT Technology to survey the extent and growth of the web, they estimated that there were 2 billion unique, publicly accessible pages on the Internet in July of 2000.   Additionally, they found that 7.3 million unique new pages were going on the net each day. In the same paper, Cyveillance predicted that there would be 4 billion publicly available pages on the Internet by early 2001. Cyveillance emphasized that their technology was able to estimate growth, an improvement over just taking a static snapshot. Their results indicted that the Internet was growing dynamically and had yet to peak. Cyveillance's most recent statistics indicate about 6 billion web pages are on the public web.  

A more conservative estimate is offered by the OCLC (Online Computer Library Center, Office of Research) in a recent report called Trends in the Evolution of the Public Web . This report sums up the results of a survey that has been conducted annually since 1998.   OCLC uses a different sampling methodology than Cyveillance.com. Consequently their numbers differ significantly. According to the results of the Web Characterization Project's most recent survey, the public web (those pages freely available to everyone), contained 3,080,000 Web sites, or 35 percent of the Web as a whole. Public sites accounted for approximately 1.4 billion Web pages. The average size of a public web site was 441 pages. ( Statistics current for June 2002.)

It is also important to understand what we mean by the phrase 'publicly accessible web pages'. The OCLC attempted to count public web pages housed on a public website. They provide this definition: "A public Web site offers to all Web users free, unrestricted access to a significant portion of its content." Many pages on the Internet remain out of reach unless you are willing to pay.


 

Graphic image: Multiarmed woman getting information from multiple sources.

The Importance of Multiple Sources!

Google, the largest of the commercial search engines currently claims about 3.3 billion indexed pages.   By some estimates this means that Google searches about half the available web pages. This reality makes the argument for using multiple search engines even stronger. Search engines index different parts of the web. True, there is significant crossover; however each engine is finding new pages every second. To get the most available relevant information you should use several search engines. Just as you'd get three estimates for a car repair, searchers should use at least three unique search engines. You'll be searching far more of the potential 6 billion pages of available information if you use Google, HotBot, and Teoma. Clearly depending on a single source for information will limit your results.  

 

 

Image: startled man with a bank check with numbers so large they run off the page.

Can you estimate the size of the Invisible Web?

Indeed, most pages of information are beyond the reach of popular search engines. These pages are part of the invisible web. Bright Planet.com estimates the number of pages hidden from commercial search engines to be 400 to 550 times larger than what is available to commercial search engines.   These pages are part of the 'invisible web' estimated by Bright Planet to be as large as 3.5 trillion pages. Other experts feel the Bright Planet estimates are inflated, but still maintain that the 'invisible web' is from two to fifty times larger than the visible web. By the conservative estimate, we can guess at there being 50 to 100 billion pages of information on the invisible web.

Invisible web pages are hidden in password-protected systems, intentionally excluded from robotic search engines, or dynamically generated by online databases at a user's request. Invisible web information may be highly relevant to your search needs, and can be found if you know where to look. (See the IMSA Micro Module: Invisible Web, for more information on this issue.)  

What Does it all Mean?

Clearly the Internet will continue to change. The more web pages of relevant information a researcher can access the better. Competition between rival search engines, improvement in indexing and retrieval technology, and the ever-increasing number of pages available on the Internet means that searching the net will continue to take both specialized knowledge and persistence. The best approach is to use multiple search engines to search the public web, and to become more aware of invisible web resources.


FAQs

cartoon of computer with the words infinite info on the screen

How many pages are there on the WWW?

While the exact numbers of web pages aren't known, an educated guess (as of 2003) would be between 3 and 6 billion pages. These are pages publicly accessible to search engines. Recently the Internet Archive, which is attempting to store archival copies of all Internet pages, announced that its Wayback Machine had 33 billion pages. 11 billion of these pages are now keyword searchable.

It is also estimated that about 7 million new pages go online each day.

How reliable are these webpage statistics?

The most current Internet statistics are closely guarded and expensive business information. Our estimates are based on information reflecting the Internet from 2000 - 2003. Additionally, there are no common methods for counting the number of pages on a website.   Some count by automatically 'pinging' an 'ip' address. If the address replies, it is considered valid. This kind of count does not distinguish between duplicate pages of information. The OCLC 2002 study estimated that there were 3,080,000 public websites, averaging 441 pages each. The OCLC harvested a representative sample of websites, and use this sample as the basis for making inferences about the Internet as a whole. The Cyveillance.com study considered 350 million links over a 4-month period in the year 2000. From this sample they built their model and made their predictions. Cyveillance claims their methodology improves reliability, but keeps the exact mechanics of their methods secret. We are left with a best guess scenario that lacks statistical reliability, but does give us a sense of scope when considering the free public web.


Can search engines reach all of the pages on the web?

The content of the Internet is constantly changing. Search engines continually crawl the web indexing publicly available pages. New pages are added and old pages are deleted or updated everyday. Just when a new or revised webpage will show up in a search engine varies from system to system. There may be a time gap of a few minutes to months.   While the commercial search engines update their indexes regularly to reflect the new pages they find, no search engine claims to visit all of the pages available on the web. (See "Search Engine Sizes" by Danny Sullivan, http://searchenginewatch.com/reports/article.php/2156481#current .)

bar graph showing Billions of Textual Documents Indexed: google 3.3 billion, All the web 3.2 billion, Inktomi 3 billion, Teoma 1.5 billion, Alta Vista 1 billion.

 

Billions of Textual Documents Indexed ( As of Sept 2, 2003 )

Additionally, many pages are hidden from search engines. These pages, named collectively the 'hidden or invisible' web, might be generated on demand (assembled by database query), or published on password-protected systems.   Additionally, some pages are intentionally tagged for robotic exclusion by their authors. This means that page authors enter special HTML robot exclusion codes that tell the 'crawlers' of search engines to skip a page and leave it out of the search engine index. Additionally PDF and multimedia files are not indexed by all search engines.

Copyright SearchEngineWatch.com 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

How many pages are beyond my reach when using one of the popular search engines?

Estimates of hidden web content vary widely. Bright Planet estimated the hidden web to be up to 550 times the size of the public web. A more conservative guess would be from 50 million to 100 billion pages are on the hidden web. To learn more see the IMSA Micro Module: The Invisible Web.

Why would anyone want to search all of the pages on the web?

Consider the importance of a comprehensive search if you are checking for plagiarism, citation verification, or uncommon or unusual topics. The more comprehensive your sources of data, the better your research. The more pages of information you search, the more likely it is you will find crucial information about your topic. There are no guarantees, but when it comes to searching, the larger the database of relevant pages, the more likely it is that you'll get solid responses to your queries.


Authored by Dennis O'Connor 2003-2004