- This page describing a method used by some on Articles for Deletion to approximate notability and sometimes in discussions over naming articles. This page discusses its proper use, limits and restrictions.
Here are some ways to use Google ([1]), Alexa ([2]), Yahoo! ([3]) and Clusty ([4]) to check articles and other information.
Types of Google tests
On Wikipedia, a Google Test is any use of Google or other search engines as references. Several very distinct kinds of information can be gleaned by this method. It should be stressed that none of these applications is conclusive evidence, but simply a first-pass heuristic or rule of thumb.
- Unencyclopedic or spurious topics. Some topics introduced to Wikipedia articles don't belong here. Some of these can be detected by running a Google search on a relevant phrase and counting the number of search results. This technique works reasonably well for weeding out hoaxes, fictions, and personal theories and hypotheses. It can also be used to ascertain whether a topic is of sufficiently broad interest to merit inclusion in the wiki, though this application is highly subject to bias (see below). See Wikipedia:What Wikipedia is not for a more comprehensive list of unencyclopedic topics.
- Copyrighted material. Large pieces of poorly wikified text, submitted to the wiki all at once, particularly by a new or anonymous user, are often copy-and-pasted from outside sources. Some of these are submitted in violation of copyright. (See also Wikipedia:Spotting possible copyright violations, Wikipedia:Copyrights.) A copy-and-paste operation from an online source can often be detected by running searches for excerpts.
- Idiosyncratic usage. The English language often has multiple terms for a single concept, particularly given regional dialects. A series of searches for different forms of a name reveals some approximation of their relative popularity. For a quick comparison of relative usage try googlefight, e.g. comparing deoxyribose nucleic acid and deoxyribonucleic acid. Note that there are cases where this googletest can be overruled, such as when an international standard has been set, as in the case of aluminium.
- Related sites. If an article is of high quality (see Wikipedia:Featured articles), Google may be used to look for sites that might take an interest in it and be convinced to link to it.
- Research. Of course, search engines are good for finding sources of further information.
Techniques
The Google Web search is not the only Google search. In performing a Google test, consider searching groups (USENET newsgroups). This is a significantly different sample and represents, for the most part, conversations in English conducted by people who are not deliberately trying to sell products or reach a mass audience. Other things being equal, a "groups" search will typically return very roughly 1/5 as many hits as a "Web" search. Because group and Web searches have very different "systemic biases," hit numbers are not comparable. Nevertheless Group searches are particularly helpful in identifying entities whose Web presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 20 Groups hits.
USENET postings are date-stamped and have been archived for over twenty years, making them more useful than Web searches as a record of recent history. Using a Groups "advanced search," it is possible to restrict a search by date, which can help in identifying how recent the widespread use of a term is.
Google News searches can assess whether something is currently newsworthy. One characteristic of Google News is that whereas it is easy and inexpensive to create websites or post to USENET, it is harder to convince a Google news source to run a story. Thus Google News, in comparison to Web or Groups, is less susceptible to manipulation by self-promoters. Note that Google News indexes many "news" sources that reflect specific points of view, and many news sources that are only of local interest.
Depending on the subject, advanced search functions may be useful. For example, adding "site:gov" or "site:edu" will restrict your search to U.S. government sites or U.S. college and university sites.
Other tools that may be useful for research include Google Scholar, which searches academic literature.
Google Book Search can be valuable. As part of the world of print, Google Book Search has a pattern of coverage that is in closer accord with traditional encyclopedia content than the Web, taken as a whole, is; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book Search provide convincing evidence for the real use of the phrase or concept. Google Book Search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact. Amazon.com's "Search Inside The Book" also can be used.
Alexa test
Although Wikipedia is not a web directory, we can have articles about web sites if they meet the same criteria for encyclopedic interest as other articles.
If you're interested in writing a Wikipedia article about a particular web site, just go to Alexa (http://www.alexa.com), and type in the URL. The traffic rank may help you decide whether a site is important enough. Most would agree that we should certainly have articles on top 100 sites, possibly have articles on top 1,000 sites. For a page not in the top 100,000, most would agree that popularity alone would not suffice to justify its inclusion in Wikipedia. These should not be seen as 'cutoffs' because a site may be notable for a different reason than just alexa popularity, instead it is one of several factors to consider. The intermediate area is a grey area where opinions differ.
For some websites (e.g., microsoft.com) in the top thousand, a redirect to a broader article may be appropriate: in that case, Microsoft. (This is somewhat controversial.)
Also note that the Alexa rating includes significant bias, due to various factors. For example, the Alexa software is only available for Microsoft Windows and Microsoft Internet Explorer, and requires installation. So, for instance, a website exclusively devoted to an Apple Macintosh related topic might not have an Alexa ranking that accurately represents its true traffic activity. On the opposite extreme, some webmasters install the Alexa toolbar for the sole purpose of improving their own rankings, by visiting their own web site with it. The Alexa toolbar's user base is small enough, that one frequent visitor can have a noticeable effect on overall results. [5]
In addition, many users refuse to install the Alexa toolbar, as they feel the recording of which websites they visit constitutes spyware.
Google bias
When using Google to test for importance or existence, bear in mind that this will be biased in favor of modern subjects of interest to people from developed countries with Internet access, so it should be used with some judgment. For example, a current popular-music group from the United States will probably need many thousands of Google hits before most Wikipedians consider it worthy of inclusion. A similarly important group in a country with less Internet presence will have many fewer hits, if any. An important musician of the 14th century might not show up on Google at all.
Q. What is the minimum number of matches you should see if a term is not made up? (3? 27? 81?)
A. Perhaps a few hundred, but this depends on several things:
- The article's point of view: If narrow, fewer references are required. Try to categorize the point of view, ( whether it is NPOV, or other) eg: notice the difference between Ontology and Ontology (computer science).
- The subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet neologism, it may be on 100 pages and might still not be considered 'existing' for Wikipedia's purposes.
- The type of sites you find: Pay attention to how open the sites are about accepting submissions. The Urban Dictionary, for example, accepts submissions freely. This is especially important if you suspect an author is self-promoting, or is promoting an idiosyncratic viewpoint. A single Internet user can submit the same ideas to message boards and open-submission sites all over the Internet.
- Duration the term has existed on Wikipedia: Sometimes when an article is created, the term initially doesn't exist anywhere else on the internet, and the Google test may help determine that it's not widely used. After an article exists on Wikipedia for an extended time, the term (e.g. article title, or other jargon in the article) will be copied to many Wikipedia mirrror sites, many unofficial mirrors, as well as "scraper" web sites which return results for any search term their web crawler encounters. Over time it may become harder to determine which hits originated from Wikipedia, and which hits reflect independent usage of the term. Wikipedia is one of the most frequently used sources of information on the web, and it's important that Wikipedia not directly or indirectly rely on itself as a source.
Further judgment: the Google test checks popular usage, not correctness. For example, a search for the incorrect Charles Windsor gives 10 times more results than the correct Charles Mountbatten-Windsor.
Also, some topics may not be on the Web because of low Internet use in certain areas and cultures of the world.
The search result from Google are highly biased towards popular culture. This article, Scientists Use Google To Measure Fame vs. Merit, for example, points out that Barry Williams ("Greg Brady" from the Brady Bunch) has 45% more Google hits than Albert Einstein (2,400,000 vs. 1,660,000).
Especially when trying to determine the frequency of use of diacritic vs. non-diacritic versions of a word, the internet (and therefore Google) is extremely biased towards the non-diacritic versions. This is often more an example of laziness and cluelessness of those who created the webpages than a real test of usage. For example, spelling the weather phenomenon El Niño as 'El Nino' is just plain wrong (it doesn't rhyme with keno, vino, or Zeno). When Spanish words that have the ñ letter get naturalized into English the ñ often gets converted to "ny" (as when cañon became canyon), but "El Niño" is rarely spelled "El Ninyo" (and that spelling is more likely not on an English-language website). Yet despite the fact that the spelling should be El Niño, a Google test shows that there are more web pages with "El Nino" than "El Niño" (8,830,000 vs. 7,970,000 as of September 2005). Much better criteria for deciding upon the use of the diacritic vs. non-diacritic versions of a word would be the entries in dictionaries, other encyclopedias, and style guides.
Note that other Google searches, particular Google Books have a different systemic bias from Google Web searches and give an interesting cross-check and a somewhat independent view.
Urban legend bias
As was mentioned above, Google checks popular usage, not correctness. Just because a particular set of facts are repeated hundreds of times in a Google search, does not make them correct. For example, there is an urban legend about the USS Constitution, which has the ship setting sail in 1779, and there are hundreds of sites which repeat this information[6]. In fact though, the ship was not even launched until 1797. Similar "juicy" stories that get repeated pretty much verbatim from source to source and webpage to webpage, routinely skew any data that might be obtainable via search engine.
Non-applicable in some cases, such as pornography
The simple Google test by number of hits is not applicable to people or titles within a number of internet-based businesses, most notably pornography. This is because an entire sub-industry has appeared with the sole purpose of increasing the number of Google hits certain subjects receive. They achieve this by use of a number of techniques, including multiple mirror sites, and spamming of notice boards and Wikipedia. Also, pornographic actors tend to appear in production-line quantities of entirely non-notable films. It is therefore necessary, as per Wikipedia:criteria for inclusion of biographies, for the researcher to prove that the actor or actress has established notoriety. This usually requires finding journalistic coverage, independent biographies or extensive fan clubs.
Validity of the Google test
Given that the results of a Google test are interpreted subjectively, its implementation is not always consistent. This reflects the nature of the test being used on a case by case basis.
In some cases, articles have been kept with Google hit counts as low as 15 and some claim that this undermines the validity of the Google test in its entirety. However, in fact, this reflects on the rather uneven and subjective nature of the Wikipedia:Articles for deletion process more than on the usefulness of the Google test. The Google test has always been and very likely always will remain an imperfect tool used to produce a general gauge of notability. It is not and should never be considered definitive.
Major factors which may affect Google hit count include subjects from countries where the Internet is not prevalent or topics which are of a historical nature but have not yet been well documented on the Internet. In other cases, it is completely speculative as to why a subject merits inclusion with a hitcount below 100 while other such articles are frequently deleted.
Also note that the number of hits that Google reports is (sometimes or perhaps always; the details are secret) an estimate, not an exact figure. The number of hits reported by Google has little meaning until one navigates to the last page of the results, since it's only then that Google applies all criteria to a query (such as eliminating duplicate and spam control). Often the hit count is cut by a factor of 10 (or much more) after doing this. Jumping to the end of the results (or as far as is practical), also reveals if the hit count is actually related to the intended meaning of the search term. Queries are further improved by setting the results per page to the maximum value (which reduces duplicate results) and excluding any domain of a bias party. For instance "JoesRockBand.com" should be excluded when searching for references to "Joe's Rock Band". For longer lasting articles, excluding the term "wikipedia" itself, may be needed, to avoid counting all the mirrors and language versions of a wikipedia article. In fact, the AFD discussion itself, once archived and indexed by Google, may actually add to the Google hit count used the next time the item is discussed. Finally, some human labor has to be involved, and a manageable sample of sites found must be opened individually, to actually verify the relevance of the hit count.
On "unique" results
For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed. This seems to be accomplished by eliminating pages which are near exact duplicates and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will only give a couple pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of unique results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of unique results will always contain less than 1000 results regardless of how many webpages actually matched the search terms. For example, from the about 742 million pages related to "Microsoft", google presently returns 552 "unique" results (as of Jan 9, 2006[7]). Because of this, caution must be used in judging the relative importance of websites having well over 1000 hits. Once the unique count goes over a few hundred, it becomes difficult or impossible to determine just how high it should be. If the unique count is extremely low (such a few dozen), that may be a sign that there really is only a small number of unique hits. Doing a site-specific search may help determine if most of the hits are coming from a single web site. A single web site can account for hundreds of thousands of hits.
By going to the end of a result list it is possible to click a link asking Google to "search with omitted results included" so as to see how many results have been removed; however, in no case will Google ever allow you to see more than 1000 results.
Search engine limitations
Much, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.
The estimated size of the World Wide Web is at least 2 billion pages, but a much deeper (and larger) Web, estimated at over 500 billion pages, exists within databases whose contents the search engines do not index. These dynamic web pages are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The United States Patent and Trademark Office website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.
Foreign languages and non-Latin scripts
Claims for the non-notability of a topic are occasionally made based on few Google hits, where a considerably larger number of hits would have resulted from searching in the correct script or for various transcriptions. An Arabic name, for instance, needs to be searched for in the original script, which is easily done with Google, provided one knows what to search for, but one also has to take into account that e.g. English, French and German webpages will likely transcribe the name using different conventions.
In addition, different forms of a name used in the original language must be searched for. A Russian personal name has to be searched for both including and excluding the patronymic, and any search for names and other words in strongly inflected languages should take into account that arriving at the total number of hits may require searching for forms with varying case-endings or other grammatical variations not obvious for someone who does not know the language.
Doing a search like this requires a certain linguistic competence which not every individual wikipedian possesses, but the Wikipedia community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on AFD at least to be aware of one's own limitations and not state conclusively a small number of Google hits for, say, a Serbian poet without pointing out the limited validity of a preliminary search using only one particular transcribed form of the name.
See Also
- Wikipedia:List of ways to verify notability of articles
- Meta:Mirror filter, a way to filter sites from Google search to remove sites which mirror Wikimedia content