Souvik Banerjee - Text embeddings -

Souvik Banerjee, supervised by Dr. Manish Shrivastava received his Master of Science – Dual Degree in Computational Linguistics (CL). Here’s a summary of his research work on Text embeddings in Riemannian manifolds:

Unsupervised text embedding models are ubiquitous in the field of Natural language Processing. These models embed words, sentences or documents as vectors in an Euclidean space with the principle that textual information that are semantically similar have similar representations i.e they lie close to each other in the semantic space. Word2vec and GLoVe are the most popular examples of such models. Both these models provide efficient training of word embeddings and provide evaluation methods like word similarity and word analogy tasks that proves the effectiveness of these models. However, they have their fair share of drawbacks. For instance, there is little to no explanation on why word vector summation solves word analogy. These models also suffer from meaning conflation deficiency where multiple senses of a word are represented by one single vector and thus lead to inaccurate semantic modelling. This is also an issue with document and sentence embeddings which span multiple words, phrases and topics. Existing unsupervised models which tackle meaning conflation deficiency perform sense representation where each sense of a word is represented by an individual vector. This is done by modifying the Word2vec skip-gram algorithm and adding some extra constraints on the vector embedding space. Instead of adding extra constraints on the Word2vec algorithm, we aim to address this issue by exploiting the geometry of the embedding space. We propose three unsupervised text embedding models that embed texts in Riemannian manifolds by integrating the linguistic principles of Word2vec and GLoVe with optimization tools from differential geometry. The first and the second model uses the same joint modelling framework but embeds both words and documents in the Grassmannian manifold and a custom product manifold respectively. Both models generate quality document embeddings given by their evaluation results in document clustering and document classification tasks. The third model embeds words in the Spectrahedron manifold where each word is a matrix whose eigenvectors correspond with the multiple senses of that word. Finally, we provide a Lie-group theoretic understanding of linear substructures present in Word2vec and GLoVe that solves word analogy. Lie groups are differentiable manifolds with a group structure. Some basic properties of these groups show that the linear substructures are present due to the tangent space of the Identity element of the group.

July 2023

Souvik Banerjee – Text embeddings