Word embedding dimensionality reduction using dynamic variance thresholding (DyVaT)

Avraham Treistman, Dror Mughaz, Ariel Stulman, Amit Dvir

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Natural language processing (NLP) provides a framework for large-scale text analysis. One common processing method uses vector space models (VSMs) which embed word attributes, called features, into highly dimensional vectors. Comprehensive VSMs are generated on sources such as the GoogleNews archive. A thesaurus, a collection of semantically-related words, can be created for a particular root word using cosine similarity with a given VSM. Many methods have been developed to reduce the complexity of these models by maintaining useful semantic information while discarding non-informative features. One such method, variance thresholding, retains high-variance features above a manually-determined threshold, providing higher differentiation between words for classification purposes. Our research developed a dimension-reducing methodology called dynamic variance thresholding (DyVaT). DyVaT reduces the specificity of word embeddings by maintaining low-variance features, allowing for a broader thesaurus preserving semantic similarity. A dynamic variance threshold, determining which low-variance features are retained, is selected using the kneedle algorithm, improving the current results. Our test case for examining the efficiency of DyVat in creating a contextual thesaurus is the visual, auditory and kinesthetic learning style context. We conclude that DyVaT is a valid method for generating loosely-connected word collections with potential uses in NLP classification or clustering tasks.

Original languageEnglish
Article number118157
JournalExpert Systems with Applications
Volume208
DOIs
StatePublished - 1 Dec 2022

Keywords

  • Dimensionality reduction
  • Feature selection
  • Learning styles
  • Machine learning
  • Natural language processing

Fingerprint

Dive into the research topics of 'Word embedding dimensionality reduction using dynamic variance thresholding (DyVaT)'. Together they form a unique fingerprint.

Cite this