July 2022
Faculty and students presented a paper on Journey to the center of the words: Word weighting scheme based on the geometry of word embeddings at the 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022), Copenhagen, Denmark from 6 – 8 July. Research work as explained by the authors Narendra Babu Unnam, Prof. P Krishna Reddy, Amit Pandey and Dr. Naresh Manwani:
A notable amount of work has been done to find sentence embeddings using compositional models in recent years. These works have shown that one of the simplest and most effective approaches to obtaining sentence embeddings is simple vector averaging of off-the shelf word embeddings trained on large corpora. Recent literature introduced word weighting schemes based on the words frequency distribution into the simple averaging model. The frequency-based weighted averaging models augmented with the denoising steps are shown to outperform many complex deep learning models. However, these frequency-based weighting schemes derive the word weights solely based on their raw counts and ignore the diversity of contexts in which these words occur. This paper proposes an alternative weighting scheme that captures the contextual diversity in the word embedding space. The proposed weighting algorithm is simple, unsupervised, and non-parametric. Experimental results on semantic textual similarity tasks show that the proposed weighting method outperforms all the baseline models with significant margins and performs competitively to the current frequency-based state-of-the-art weighting approach. Furthermore, as the frequency distribution-based approaches and the proposed word embeddings geometry-based weighting approach capture two different properties of the words, we define hybrid weighting schemes to combine both the varieties. We also empirically demonstrate that the hybrid weighting methods perform consistently better than the corresponding individual weighting schemes.
Website: https://ssdbm.org/2022/