Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy & Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy & Goldberg (2014). The proposed estimator also performs comparably to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set.
➤ Version 1 (2023-01-17) |
Neil Dey, Matthew Singer, Jonathan Williams and Srijan Sengupta (2023). Word Embeddings as Statistical Estimators. Researchers.One. https://researchers.one/articles/23.01.00006v1
© 2018–2025 Researchers.One