Page:The World Within Wikipedia: An Ecology of Mind.pdf/6

This page needs to be proofread.
Information 2012, 3
234

the columns of context words to the most frequent words only. Although the usefulness of these variations has been disputed in other frameworks[1], we nevertheless follow the original process for replication purposes. A third variant of COALS transforms the co-occurrence matrix using singular value decomposition (SVD)[2]. SVD is the key step in LSA and is used in COALS in a similar way: To eliminate noise in the matrix. In this variant, called COALS-SVD, the matrix A is first reduced to its most common 15,000 rows and 14,000 columns, forming the submatrix B. Phi-normalization as described above is applied to B, and finally B is transformed using SVD into three matrices



where U and V are orthonormal matrices and Σ = diag(σ1 , . . . , σn ) and σ1 ≥ . . . ≥ σn ≥ 0. The σi are the singular values of the matrix B. The desired matrix for word-word comparisons is U , whose row vectors are the SVD-denoised versions of B’s row vectors. Observe that right multiplying B by VΣ−1 yields U



By this identity, the full vocabulary of the original A matrix may be projected into the SVD solution for B, as long as the column dimensions of A and B match (e.g., 14,000). To do this simply right multiply A by VΣ−1



UA’s row vectors are SVD-denoised versions of A’s row vectors, defined by the SVD solution of B.


2.2. Explicit Semantic Analysis


Explicit Semantic Analysis (ESA) uses the article structure of Wikipedia without considering the link structure[3],[4]. The intuition behind ESA is that while traditional corpora are arranged in paragraphs, which might contain a mixture of latent topics, the topics in Wikipedia are explicit: Each article is a topic, or correspondingly a concept. ESA defines term vectors in terms of their occurrence in Wikipedia articles. Because the meaning representation of each word is defined by its co-occurrence with article concepts, ESA can be considered as a word-concept level model. ESA vectors are based on frequency counts weighted by a variation of term frequency-inverse document frequency (tf-idf) [35]:



where υij is the number of occurrences of a term i in an article j. Correspondingly:



where |A| is the total number of Wikipedia articles and the denominator is the number of articles in Wikipedia that contain a given term. An ESA vector for term i is defined by:

  1. Bullinaria, J.; Levy, J. Extracting semantic representations from word co-occurrence statistics: A computational study. Behav. Res. Methods 2007, 39, 510–526.
  2. Trefethen, L.N.; Bau, II, D. Numerical Linear Algebra; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1997.
  3. Gabrilovich, E.; Markovitch, S. Computing Semantic Relatedness UsingWikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 1606–1611.
  4. Gabrilovich, E.; Markovitch, S. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 2009, 34, 443–498.