Word Embeddings

I chose to look at the topic of word embeddings, as it was and still is a popular method for semantic analysis of texts. As output, word embeddings return words in a corpus that are used in similar/interchangeable contexts. Last time, I was looking at documents within the field of Slavic languages & literatures; however, it was difficult to find texts prominently used in the field, as Web of Science caters to article publications, typically within STEM. So, I decided to look at a topic within computer science that still has a relation to text (and sometimes literature). Also, since word embeddings are still relatively “new” within computer science–the most prominent algorithm “word2vec” was developed in 2013–there weren’t too many articles I needed to download. In total, the search returned 847 articles.

The first network I generated was through following the directions in the tutorial (binary count, title+abstract as text data). Prominent topics were “language,” “recurrent neural network,” “convolutional neural network,” and “sentiment analysis.” I then changed the count to full count to see the difference in clustered groups. With full count, specific languages such as Spanish and Hindi were included as nodes in the network, and topics such as social media were included. It seems that binary count gave a broader overview of general technical terms related to the algorithm, while full count included topics more specific to each article’s research area. (I tried to upload images, but the file size was too large and attempts to compress the images didn’t really work out).

My choice of data limits me to a specific research tool within a domain; moreover, it is limited to the term “word embeddings,” which only refers to the output of an algorithm, not the algorithm itself. If I wanted to look at the algorithm “word2vec” specifically, results could differ, as 325 articles are returned when searching “word2vec” as a topic as opposed to “word embeddings” on Web of Science. A network might include more specific terms as nodes with an anchor topic of a specific algorithm. I could also choose to eliminate the most frequent terms that are less content-specific from the data before generating the network (this is a common technique in CS to reduce noise). Overall, this model produces a more general model of an academic domain, which can be useful; however, I wouldn’t say that the networks that were produced speak to the full usage of word embeddings in research. This is probably due to the somewhat limited scope of articles in the database.

Digital / Critical Interdisciplinary Methods

University of Pittsburgh, Spring 2020

Leave a Reply Cancel reply