Texas

This is a how-to document describing a tool that is used on Votes for Deletion and sometimes in discussions over naming articles. This discussion talks about its proper use, limits and restrictions.

Here are some ways to use Google([1]), Alexa([2]) and Yahoo! ([3]) to check articles and other information.

Types of Google tests

On Wikipedia, a Google Test is any use of Google or other search engines as references. Several very distinct kinds of information can be gleaned by this method. It should be stressed that none of these applications is conclusive evidence, but simply a first-pass heuristic or rule of thumb.

Unencyclopedic or spurious topics. Some topics introduced to Wikipedia articles don't belong here. Some of these can be detected by running a Google search on a relevant phrase and counting the number of search results. This technique works reasonably well for weeding out hoaxes, fictions, and personal theories and hypotheses. It can also be used to ascertain whether a topic is of sufficiently broad interest to merit inclusion in the wiki, though this application is highly subject to bias (see below). See Wikipedia:What Wikipedia is not for a more comprehensive list of unencyclopedic topics.

Copyrighted material. Large pieces of poorly wikified text, submitted to the wiki all at once, particularly by a new or anonymous user, are often copy-and-pasted from outside sources. Some of these are submitted in violation of copyright. (See also Wikipedia:Spotting possible copyright violations, Wikipedia:Copyrights.) A copy-and-paste operation from an online source can often be detected by running searches for excerpts.

Idiosyncratic usage. The English language often has multiple terms for a single concept, particularly given regional dialects. A series of searches for different forms of a name reveals some approximation of their relative popularity. For a quick comparison of relative usage try googlefight, e.g. comparing deoxyribose nucleic acid and deoxyribonucleic acid. Note that there are cases where this googletest can be overruled, such as when an international standard has been set, as in the case of aluminium.

Related sites. If an article is of high quality (see Wikipedia:Featured articles), Google may be used to look for sites that might take an interest in it and be convinced to link to it.

Research. Of course, search engines are good for finding sources of further information.

Techniques

The Google Web search is not the only Google search. In performing a Google test, consider searching groups (USENET newsgroups). This is a significantly different sample and represents, for the most part, conversations in English conducted by people who are not deliberately trying to sell products or reach a mass audience. Other things being equal, a "groups" search will typically return very roughly 1/5 as many hits as a "Web" search. Because group and Web searches have very different "systemic biases," hit numbers are not comparable. Nevertheless Group searches are particularly helpful in identifying entities whose Web presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 20 Groups hits.

USENET postings are date-stamped and have been archived for over twenty years, making them more useful than Web searches as a record of recent history. Using a Groups "advanced search", it is possible to restrict a search by date, which can help in identifying how recent the widespread use of a term is.

Google News searches can assess whether something is currently newsworthy. One characteristic of Google News is that whereas it is easy and inexpensive to create websites or post to USENET, it is harder to convince a Google news source to run a story. Thus Google News, in comparison to Web or Groups, is less susceptible to manipulation by self-promoters. Note that Google News indexes many "news" sources that reflect specific points of view, and many news sources that are only of local interest.

Depending on the subject, advanced search functions may be useful. For example, adding "site:gov" or "site:edu" will restrict your search to U.S. government sites or U.S. college and university sites.

Other tools that may be useful for research include Google Scholar, which searches academic literature, and Google Print, which searchs the contents of books.

Alexa test

Although Wikipedia is not a web directory, we can have articles about web sites if they meet the same criteria for encyclopedic interest as other articles.

If you're interested in writing a Wikipedia article about a particular web site, just go to Alexa (http://www.alexa.com), and type in the URL. The traffic rank may help you decide whether a site is important enough. Most would agree that we should certainly have articles on top 100 sites, possibly have articles on top 1,000 sites. For a page not in the top 100,000, most would agree that popularity alone would not suffice to justify its inclusion in Wikipedia. The intermediate area is a grey area where opinions differ.

For some websites (e.g., microsoft.com) in the top thousand, a redirect to a broader article may be appropriate: in that case, Microsoft. (This is somewhat controversial.)

Also note that the Alexa rating includes significant bias, due to various factors. For example, the Alexa software is only available for Microsoft Windows and Microsoft Internet Explorer, and requires installation. So, for instance, a website exclusively devoted to an Apple Macintosh related topic might not have an Alexa ranking that accurately represents its true traffic activity.

See also Wikipedia:Web comics for some specific advice related to web comics.

Google bias

When using Google to test for importance or existence, bear in mind that this will be biased in favor of modern subjects of interest to people from developed countries with Internet access, so it should be used with some judgment. For example, a current popular-music group from the United States will probably need many thousands of Google hits before most Wikipedians consider it worthy of inclusion. A similarly important group in a country with less Internet presence will have many fewer hits, if any. An important musician of the 1300s might not show up on Google at all.

Q. What is the minimum number of matches you should see if a term is not made up? (3? 27? 81?)

A. A couple hundred perhaps! It depends on several things:

The article's point of view: If narrow, fewer references are required. Try to categorize the point of view, ( whether it is NPOV, or other) eg: notice the difference between Ontology (philosophy) and Ontology (computer science).
The subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet neologism, it may be on 100 pages and might still not be considered 'existing' for Wikipedia's purposes.
The type of sites you find: Pay attention to how open the sites are about accepting submissions. The Urban Dictionary, for example, accepts submissions freely. This is especially important if you suspect an author is self-promoting, or is promoting an idiosyncratic viewpoint. A single Internet user can submit the same ideas to message boards and open-submission sites all over the Internet.

Further judgment: the Google test checks popular usage, not correctness. For example, a search for the incorrect Charles Windsor gives 10 times more results than the correct Charles Mountbatten-Windsor.

Also, some topics may not be on the Web because of low Internet use in certain areas and cultures of the world.

Validity of the Google test

Given that the results of a Google test are interpreted subjectively, its implementation is not always consistent. This reflects the nature of the test being used on a case by case basis.

In some cases, articles have been kept with Google hit counts as low as 15 and some claim that this undermines the validity of the Google test in its entirety. However, in fact, this reflects on the rather uneven and subjective nature of the Wikipedia:Votes for deletion process more than on the usefulness of the Google test. The Google test has always been and very likely always will remain an imperfect tool used to produce a general gauge of notability. It is not and should never be considered definitive.

Major factors which may affect Google hit count include subjects from countries where the internet is not prevalent or topics which are of a historical nature but have not yet been well documented on the internet. In other cases, it is completely speculative as to why a subject merits inclusion with a hitcount below 100 while other such articles are frequently deleted.

Also note that the number of hits that Google reports is (sometimes or perhaps always; the details are secret) an estimate, not an exact figure.

Search engine limitations

As of 2000, the percentage of the World Wide Web indexed by select search engines was as follows:

Inktomi (Used by AOL and MSN): 50%
Fast Search and Transfer: 34%
AltaVista: 25%
Northern Light: 24%
Excite: 21%
Google: 20%
Go (Infoseek): 6%
Lycos: 6%

Due to the disparity in these statistics, different search engines may bring up widely different numbers of hits for a particular topic. [Vive la rose], for instance, had about 1,478 MSN hits and 591 Google hits as of April [2005]. Users may want to try a few different search engines before making a decision on a subject's notability.

The estimated size of the World Wide Web is at least 2 billion pages, but a much deeper (and larger) Web, estimated at over 500 billion pages, exists within databases whose contents the search engines do not index. These "dynamic" pages are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The United States Patent and Trademark Office website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.

Foreign languages and non-Latin scripts

Claims for the non-notability of a topic is occasionally made based on few Google hits, where a considerably larger number of hits would have resulted from searching in the correct script or for various transcriptions. An Arabic name, for instance, needs to be searched for in the original script, which is easily done with Google, provided one knows what to search for, but one also has to take into account that e.g. English, French and German webpages will likely transcribe the name using different conventions.

In addition, different forms of a name used in the original language must be searched for. A Russian personal name has to be searched for both including and excluding the patronymic, and any search for names and other words in strongly inflected languages should take into account that arriving at the total number of hits may require searching for forms with varying case-endings or other grammatical variations not obvious for someone who does not know the language.

Doing a search like this requires a certain linguistic competence which not every individual wikipedian possesses, but the Wikipedia community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on VfD at least to be aware of one's own limitations and not state conclusively a small number of Google hits for, say, a Serbian poet without pointing out the limited validity of a preliminary search using only one particular transcribed form of the name.

@@ Line 33: / Line 33: @@
 Although [[Wikipedia:What Wikipedia is not|Wikipedia is not]] a web directory, we can have articles about web sites ''if they meet the same criteria for encyclopedic interest as other articles.''
+If you're interested in writing a Wikipedia article about a particular web site, just go to [[Alexa Internet|Alexa]] ([http://www.alexa.com http://www.alexa.com]), and type in the URL. The traffic rank may help you decide whether a site is important enough. Most would agree that we should certainly have articles on top 100 sites, possibly have articles on top 1,000 sites. For a page not in the top 100,000, most would agree that popularity alone would not suffice to justify its inclusion in Wikipedia.  The intermediate area is a grey area where opinions differ.
-If you're interested in writing a wikipedia article about a particular web site,
-just go to [[Alexa Internet|Alexa]] ([http://www.alexa.com http://www.alexa.com]), and type in the URL. The traffic rank may help you decide whether a site is important enough. Most would agree that we should certainly have articles on top 100 sites, possibly have articles on top 1,000 sites. For a page not in the top 100,000, most would agree that popularity alone would not suffice to justify its inclusion in Wikipedia.  The intermediate area is a grey area where opinions differ.
 For some websites (e.g., [[microsoft.com]]) in the top thousand, a redirect to a broader article may be appropriate: in that case, [[Microsoft]]. (This is somewhat controversial.)

The best road to progress is freedom's road. - JFK