The instances of templates in Wikipedia form an interesting data set of structured information. The so-called cite journal template is primarily used for citation to articles in scientific journals. These citations using the template can be extracted and analyzed: Non-negative matrix factorization is performed on a (article x journal) matrix resulting in a soft clustering of Wikipedia articles and scientific journals, each cluster more or less representing a scientific topic. |
Below is an image of the clusters from an analysis of one 2007 dump of Wikipedia. In this particular analysis the dominating clusters in the model are about astronomy, Einstein, medicine, intelligence, bacteria and human leukocyte antigen.
Cluster bush visualization of clusters in outbound scientific citations in Wikipedia. On each cluster is shown part of the title of representative Wikipedia articles for the cluster.
Not all the more than two million articles of Wikipedia are analyzed with NMF, only those Wikipedia articles that includes the Cite journal template, which are only tens of thousands articles.
Part of the question session of the presentation at Wikimania was filmed by the Bibliotheca Alexandrina. An MPEG movie (188MB) is available.
Finn Årup Nielsen is a senior researcher at the Department of Informatics and Mathematical Modelling at the Technical University of Denmark on a grant from the Lundbeckfonden to CIMBI. He is also attached to Neurobiology Research Unit at the Copenhagen University Hospital Rigshospitalet. He contributes from time to time on the Danish and English language Wikipedias as the “fnielsen” user. |