Scientific citations in Wikipedia
Is Wikipedia a reliable source for sources?
Seems so!
Abstract
The Internet-based encyclopædia Wikipedia has grown to
become one of the most visited Web sites on the Internet,
but critics have questioned the quality of entries. An
empirical study of Wikipedia found errors in a 2005 sample
of science entries. Biased coverage and lack of sources
are among the “Wikipedia risks.”
The study here describes a
simple assessment of these aspects by examining the
outbound links from Wikipedia articles to articles in
scientific journals with a comparison against journal
statistics from Journal Citation Reports such as impact
factors.
The results show an increasing use of structured
citation markup and good agreement with citation patterns
seen in the scientific literature though with a slight
tendency to cite articles in high-impact
journals such as Nature and Science. These results
increase confidence in Wikipedia as a reliable information
resource for science in general.
|
The study
This study went through the entire English
Wikipedia corpus (that's 2.5 gigabyte —
compressed!) to identify structured scientific
citations, and count them to see which journals were cited the
most.
Some interesting points:
- The use of structured citations in Wikipedia has
increased.
That is, statements in Wikipedia articles get
more and more supported by citations to trusted sources.
- Most cited scientific journals from Wikipedia are the Nature and Science
cross-disciplinary journals.
Most scientist would probably regard them as the two most
important scientific journals.
- The total scientific citation pattern in
Wikipedia is quite comparable to the total citation pattern seen
between journals, though there is some tendency for
Wikipedia contributors to cite high-impact journals,
such as Nature and Science, more than journals
that receive a lot of citations, such as Journal of Biological Chemistry.
- “Astro”-journals are often cited — more than
would be expected from statistics from Journal Citation Reports.
The
Astrophysical Journals was found to be the most cited
“Astro”-journal.
- Many citations also go to Australian botany journals, seemingly
because of the Banksia
Wikiproject that has made well-referenced articles for this
genus of plants with the beautiful
flowers.
A number of the articles for these plants has become so-called
“featured” on Wikipedia:
Coast
Banksia, Brown's Banksia (this Banksia is listed as endangered),
Heath-leaved
Banksia and Banksia epica.
- Computer journals are not that much cited as one — or
at least I —
would have expected.
Communications of the
ACM has been the most cited.
- Although there are other ways to reference newspaper
articles, some Wikipedia contributors have used the
mechanism that was investigated in the study and which is
mostly used for journals.
The newspaper that receives the most citation via this method
is The New York Times.
...And a Danish note:
- The “most“ cited Danish journal is Ugeskrift for
Læger (as far as I can determine).
My present count (July 2007) for this
journal is just 5 citations! As far as I can determine the
structured citations is not yet (as of July 2007) used on
the Danish Wikipedia.
By far the most cited journals are in English.
For comparing the Wikipedia citation number the present de facto
standard for counting journal citations was used: the Journal
Citation Reports (JCR) from Thomson Scientific.
JCRs are available on the web, but the company requires paid
subscription to view the numbers.
Most recent analysis
The most recent analysis I have done is for the July Wikipedia
database dump.
The dump has now grown to a 2.9 gigabytes compressed file.
The image below shows the result in a scatter plot where the
Wikipedia citations to each journal are compared to a
combined number of total citations and impact factor from Journal
Citation Reports.
The upper right corner has Nature and Science, while
the journal shown as the left most dot is Australian
Systematic Botany.
Scatter plot of Wikipedia citations and Journal Citation
Reports.
Wikipedia data from July 2007. Click for high
resolution image.
A word of caution
Wikipedia is evolving and requires no strict formatting of
references. A citation may be formatted in a variety of ways; it
may be removed or reformatted.
Not all citations from Wikipedia were counted.
Very many citations use a free-hand format for the reference,
and in the study I did not attempt to count all these citations.
I estimate that at least half (and probably more) of the references are using the
free-hand format as of July 2007.
The numbers in the articles are only for “one-line”
citations. Actually a structured citation may span multiple
lines and these were not counted. The most recent analysis count
them (shown in the scatter plot above).
There is a number of citations that are not matched by my
algoritm since the reference may be not nicely formatted or lack
the essential journal information. It only amounts to a minor
part, — less than 5%
I believe that these issues affects the numbers
equally so that the relatively the numbers can be trusted.
References and Downloads
The study was published in the electronic open access journal “First Monday” — the
August 2007 issue:
The PDF file from arXiv and my university department publication
database has higher resolution images, but does not incorporate
final edits that First Monday did.
Comment
Dansk omtale (Danish comments)
Author
Previous study by the author: Mining Posterior Cingulate
Newer study by the author:
Clustering scientific
citations in Wikipedia
$Id: Nielsen2007Scientific.html,v 1.18 2008/07/14 14:55:10 fn Exp $