Page:The World Within Wikipedia: An Ecology of Mind.pdf/8

This page needs to be proofread.
Information 2012, 3
236

similarity is the cosine between these two vectors. The inlink metric is modeled after Normalized Google Distance[1] and measures the extent to which the inlinks X of article x intersect the inlinks Y of article y. If the intersection is inclusive, X = Y, the metric is zero:



Inlink and outlink metrics are averaged to produce a composite score. Since each anchor defines a set of possible articles, the computations above produce a list of scored pairs of articles for a given pair of anchors. For example, the anchor bill links to bill-law and bill-beak, and the anchor board links to board-directors and board-game, leading to four similarity scores for each possible pairing. WLM selects a particular pair by applying the following heuristics. First, only articles that receive at least 1% of the anchor’s links are considered. Secondly, WLM accumulates the most related pairs (within 40% of the maximum related pair) and selects from this list the most related pair. It’s not clear from the discussion in Milne & Witten[2] whether this efficiency heuristic differs from simply selecting the most probable pair except in total search time.


2.4. W3C3: Combined Model


In this section we present our combined model using implementations of the models described above. We call this model W3C3 because it combines information at the word-word, word-concept, and concept-concept levels. For each model except COALS, reference implementations were chosen that are freely available on the web.


To implement Wikipedia Miner’s WLM, we downloaded version 1.1 from Sourceforge[3] and an xml dump of Wikipedia from October 2010[4]. ESA does not have a reference implementation provided by its creators. However, Gabrilovich recommends another implementation with specific settings to reproduce his results[5]. Following these instructions, we installed a specific build of Wikiprep-ESA[6] and used a preprocessed Wikipedia dump made available by Gabrilovich[7]. We created our own implementation of COALS and created a COALS-SVD-500 matrix using the same xml dump of Wikipedia from October 2010 as was used for WLM above.


One intuition that motivates combining all three techniques into a single model is that each represents a different kind of meaning at a different level: Word-word, word-concept, and concept-concept. This intuition was the basis for our simplistic unsupervised W3C3 model, which is simply to average the relatedness scores given by these three techniques. Two relevant properties of the W3C3 model are worth noting. First, it has not been trained on any part of the data. Secondly, it has no parameters for combining the three constituent models; rather their three outputs are simply averaged to yield a single output score.


3. Study 1: WordSimilarity-353


The WordSimilarity-353[8],[9] collection is a standard dataset widely used in semantic relatedness research[10],[11],[12],[13],[14],[15],[16],[17]. It was developed as a means of assessing similarity metrics by comparing their output to human ratings. WordSimilarity-353 contains 353 pairs of nouns and their corresponding judgments of semantic association. The nouns range in frequency from low (Arafat) to high (love)

  1. Manning, C.D.; Sch¨utze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999.
  2. Milne, D.; Witten, I.H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, Chicago, IL, USA, 13–14 July 2008; AAAI Press: Chicago, IL, USA, 2008; pp. 25–30.
  3. Milne, D. Wikipedia Miner. 2010 Available online: http://wikipedia-miner.sourceforge.net (accessed on 21 February 2011).
  4. Wikipedia. Enwiki-20101011-pages-articles.xml. 2010. Available online: http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-articles.xml.bz2 (accessed on 31 November 2010).
  5. Gabrilovich, E. Explicit Semantic Analysis (ESA). 2011. Available online: http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html (accessed on 5 December 2010).
  6. Calli, C. Wikiprep-esa. 2011. Available online: https://github.com/faraday/wikiprep-esa/archives/c36cb9481f46e9edabda1663b7a3be8c1b205bd5 (accessed on 15 December 2010).
  7. Gabrilovich, E. Publicly available implementations. 2011. Available online: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikipedia-051105-preprocessed.tar.bz2 (accessed on 20 December 2010).
  8. Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. ACM Trans. Inf. Syst. 2002, 20, 116–131.
  9. Gabrilovich, E. The WordSimilarity-353 Test Collection. 2011. Available online: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ (accessed on 20 December 2010).
  10. Gabrilovich, E.; Markovitch, S. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 1606–1611.
  11. Gabrilovich, E.; Markovitch, S. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 2009, 34, 443–498.
  12. Milne, D.; Witten, I.H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, Chicago, IL, USA, 13–14 July 2008; AAAI Press: Chicago, IL, USA, 2008; pp. 25–30.
  13. Agirre, E.; Alfonseca, E.; Hall, K.; Kravalova, J.; Pas¸ca, M.; Soroa, A. A Study on Similarity and Relatedness Using Distributional and WordNet-Based Approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics ( NAACL ’09), Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 19–27.
  14. Pirr´o, G.; Euzenat, J. A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness. In Proceedings of the 9th International Semantic Web Conference on the Semantic Web–Volume Part I; Springer-Verlag: Berlin, Germany, 2010; ISWC’10, pp. 615–630.
  15. Reisinger, J.; Mooney, R. A Mixture Model with Sharing for Lexical Semantics. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing ( EMNLP ’10); Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 1173–1182.
  16. Tsatsaronis, G.; Varlamis, I.; Vazirgiannis, M. Text relatedness based on a word thesaurus. J. Artif. Int. Res. 2010, 37, 1–40.
  17. Tsatsaronis, G.; Varlamis, I.; Vazirgiannis, M. Text relatedness based on a word thesaurus. J. Artif. Int. Res. 2010, 37, 1–40.