Naming of Place in Russian Geography and Literature

World Historical Gazetteer

For this portion of the assignment, I chose to look at Saint Petersburg, as I know it refers to a place in both Florida and Russia. When I searched this term, I was not surprised to see dots appear in these two places on the map, but it was interesting to see the other places that are referenced by this word pairing.  

Along with Florida and Russia, “Saint Petersburg” references locations in Colorado, Pennsylvania, and South Dakota. Of the seven returned results, Florida is the only space with more than one result. containing the variants “Saint Petersburg,” “Saint Petersburg Beach,” and “Port of Saint Petersburg.”

Apart from the various locations of Saint Petersburgs globally, I wanted to look at this particular city due to its change in name in Russia over time. When I clicked on the returned result in Russia, there were five different attestations: “Saint Petersburg,” “Sankt-Peterburg,” “Leningrad,” “Petrograd,” and “St Petersburg.” “Sankt-Peterburg” is the transliteration of the Russian name for the city (Санкт-Петербург), while “Leningrad” and “Petrograd” refer to  names given to the city in the twentieth century. “Grad” [град] is the Old Slavic term for “gorod” [город], meaning “city,” so both “Leningrad” and “Petrograd” refer to Lenin’s city and Peter’s (Peter the Great) city, respectively. Saint Petersburg was renamed Petrograd following the first World War, then renamed to Leningrad following Lenin’s death in 1924. Seeing that Санкт-Петербург appeared as a listed variant to “Saint Petersburg,” I then searched the city’s name in Cyrillic. Unsurprisingly, the only returned result was in Russia, as opposed to the seven results returned with the English search term of the city name.

Overall, I found the interface easy to use and interesting; however, the one question that I have relates to the numbers that appear over each green dot in the “temporal attestations” view. I’m unsure as to what these numbers refer to, and there isn’t a link on the numbers, as is present on other reference numbers that appear throughout the interface.

Recogito 

Originally, I wanted to look at a Russian text in the original language, so I used an excerpt of Dostoevsky’s The Double [Двойник] to test whether that would be possible. When tagging the protagonist’s name (Yakov Petrovich Golyadkin [Яков Петрович Голядкин]), the interface recognized that there was another appearance of the term, tagging it as a name as well. However, there was another occurrence of the name that was not recognized: the last name in the genitive case (Golyadkina [Голядкина]). I’m unsure of exactly how the technology works, but it does seem that patterns are matched through an exact match of character strings, as opposed to through coreference resolution, where terms that are different in spelling, yet reference the same entity (such as a character’s name/nickname, etc.), are able to be recognized. I also wondered whether each tag that is made has some way of telling whether it is a repeatable reference to a singular entity: does each occurrence of “Yakov Petrovich Golyadkin” that is tagged as a name internally recognize that it is referencing the same named entity (maybe through id number)? Because of the cases, I decided to look at a Russian text in English, as place names can also change through cases. For example, the sentence “I live in Saint Petersburg” would be “Я живу в Санкт-Петербурге” with the name of the city in the prepositional case, differing from the nominative form of the city, Санкт-ПетербургIn place-dense texts, this would present problems as characters move to and from cities, as well as attribute things to cities, changing the case of the word. I wanted to make sure that cities that differ only in case are not recognized as distinct entities: this problem isn’t present in English, so I went with that instead.

For the sake of tagging repeatable place names that can be recognized as such through consistency of spelling, I looked at Tolstoy’s War & Peace [Война и Мир], which is a fictional account of the lives of three central families during the time of the Napoleonic Wars (1803-1815). I chose this novel in particular due to its long passages describing battles/battlefields, as well as movement of forces across space and place. For the tagging, I looked at a chapter describing the events following the Battle of Borodino, tagging the names of people and places.

I then looked at the “Summary” pie chart, which stated that there where 38 annotations: 14 people and 24 places. It would be interesting to see a breakdown of this information: which places are referenced the most, as well as people? This also relates to my question about ids for the tags: are repeated references recognized as distinct or as related to the same named entity? In the document, there are 6 distinct place names, which occur a total of 24 times. There is a distinction to be made between 24 place occurrences and 24 places; I would be interested to see a count of the occurrence of each distinct place, separate from the more general “places that have been counted in the document” view. However, I think that this is an interesting tool that is well-designed and accessible, although I want to know more about the underlying technology.

Word Embeddings

I chose to look at the topic of word embeddings, as it was and still is a popular method for semantic analysis of texts. As output, word embeddings return words in a corpus that are used in similar/interchangeable contexts. Last time, I was looking at documents within the field of Slavic languages & literatures; however, it was difficult to find texts prominently used in the field, as Web of Science caters to article publications, typically within STEM. So, I decided to look at a topic within computer science that still has a relation to text (and sometimes literature). Also, since word embeddings are still relatively “new” within computer science–the most prominent algorithm “word2vec” was developed in 2013–there weren’t too many articles I needed to download. In total, the search returned 847 articles.

The first network I generated was through following the directions in the tutorial (binary count, title+abstract as text data). Prominent topics were “language,” “recurrent neural network,” “convolutional neural network,” and “sentiment analysis.” I then changed the count to full count to see the difference in clustered groups. With full count, specific languages such as Spanish and Hindi were included as nodes in the network, and topics such as social media were included. It seems that binary count gave a broader overview of general technical terms related to the algorithm, while full count included topics more specific to each article’s research area. (I tried to upload images, but the file size was too large and attempts to compress the images didn’t really work out).

My choice of data limits me to a specific research tool within a domain; moreover, it is limited to the term “word embeddings,” which only refers to the output of an algorithm, not the algorithm itself. If I wanted to look at the algorithm “word2vec” specifically, results could differ, as 325 articles are returned when searching “word2vec” as a topic as opposed to “word embeddings” on Web of Science. A network might include more specific terms as nodes with an anchor topic of a specific algorithm. I could also choose to eliminate the most frequent terms that are less content-specific from the data before generating the network (this is a common technique in CS to reduce noise). Overall, this model produces a more general model of an academic domain, which can be useful; however, I wouldn’t say that the networks that were produced speak to the full usage of word embeddings in research. This is probably due to the somewhat limited scope of articles in the database.

 

 

Boris Eikhenbaum’s “How Gogol’s Overcoat Was Made”

I chose to look at citations of Boris Eikhenbaum’s “How Gogol’s Overcoat Was Made,” as it is used in many courses and articles to talk about Gogol’s narrative technique of skaz, from the Russian verb skazat’ (to tell), which aims to form a written language that mirrors oral storytelling. Because the work was originally written in Russian, there were two instances of the work that appeared, “How Gogol’s Overcoat IS Made” and “How Gogol’s Overcoat WAS Made.” The first is translated incorrectly, as the verb “to make” is in the past tense in the original title; however, they still reference the same work, so I included them both in the citation search. Overall, the total number of citations was 16, across 6 articles, with the work being cited 1.23 times per year. The English translation of the work was first made available in the 1960s, so it was surprising to see a small amount of citations be returned, as I have seen it cited in a lot of articles; however, this could be due to the specific translations being cited and delineated as separate entities.

Unsurprisingly, everyone who cited the article was a Slavist. Boris Eikhenbaum was a prominent Russian Formalist, and this work is uniformly used to teach Gogol’s prose; moreover, when writing on Gogol, it is very common to discuss the mimetic, speech-like qualities of his works, which Eikhenbaum’s work foregrounds. When I performed the basic search to see all of Eikhenbaum’s works, I was a little confused, as when I did the citation search, many more of his works were returned. There might be some distinction I’m missing between the basic and citation search (are the citations related to works cited in works available on Web of Science, while the basic search returns works from Eikhenbaum that are actually available in the database?). The basic search returned 7 of Eikhenbaum’s works, with an H-index of 1, and average citation of 0.29 per item. I’m unsure which number would be better to report, as each doesn’t look very appealing for evaluations/tenure.

When working on this analysis, I saw that it can have utility for certain fields; however, for my field of Slavic Languages and Literatures, the transliterations of the authors’ names and translations of the works’ titles made it harder to narrow the search down, as one person could have many different names in the database. For example, I originally wanted to look at Yuri Lotman’s work in Russian Formalism; however, his name appeared as Iuri Lotman, Iurii Lotman, Juri Lotman, Yuri Lotman, etc., so finding a way to aggregate the citations was made a little more difficult. Overall, though, I think this is a useful tool for seeing when works are cited and by whom, and if I had looked at influential computer science articles from the U.S., a more streamlined search might have occurred.

Gender Inequality in Software Engineering

The dimension of gender inequality I was most interested in was the representation of women in the field of software engineering. This is a particularly important area for me, as I studied computer science in college and worked as a software engineer at an education software company during my third year. When I was hired, I was told “congratulations, you’re number four!”, alluding to the fact that I was the fourth woman to be hired as a software engineer in the over ten years of the company’s existence. The company was a local company, servicing most school districts in Florida (and Texas, for some reason), so the scale at which they hire is much smaller than a traditional tech company, like Google; however, the trend of a lack of representation of women in software continues at the largest scales, as well. 

For example, when you navigate to Google’s “Diversity” site, found through the following link: https://diversity.google/annual-report/, you can see that they have congratulated themselves on diversifying their teams, but there is still a disparity between the representation of women in tech and non-tech positions (22.9% tech, 47.9% non-tech, globally in 2018); this can probably be linked to the tendency of companies to tell women applying for tech positions that they can “work their way up” from a non-tech to a tech position, eventually being able to be hired into the original tech position for which they applied.  The disparity looks even worse when you look into the representation of other groups, particularly non-white women. Of all tech employees hired at Google (in the United States) in 2018, only 0.8% are Black+ women, 1.4% are Latinx+ women, and 0.3% are Native American+ women; the women with the most representation are Asian+ and White+ women at 15.9% and 10.3%, respectively. 

However, this is just one company, and the diversity statistics they release are quite opaque: it would be highly valuable to access information about how long women stay in these tech positions once they are hired, as well as which positions they are actually filling; thus, in keeping their data broad, Google is able to report a greater diversity percentage than might actually exist. I tried finding data specifically related to software engineering globally, but it seems that this data is hard to find, and this makes sense, as I’m sure most tech companies do not want the lack of representation on their software engineering teams to be made public in an easily accessible way. I did find some global data breaking down the representation of women with software engineering skills in different sectors of tech, found here. The data is taken from LinkedIn, which does limit the scope of the data, as it assumes that the representation of employment on LinkedIn parallels that which is found in reality, which it likely does not; however, I still found it interesting, as it was one of the only resources I could find that attempts to show the amount of women in different employment sectors related to software engineering. Moreover, the data is from 2013, so it is a little bit dated, especially considering the growth of tech jobs as a whole in the years since. 

I have chosen to focus on software engineering in particular, as it is a job role that is found at most tech companies and interestingly enough was pioneered by women, when the task of punching code cards was looked at as lesser than the work that men were doing in hardware. I was not surprised to find a general lack of information, especially as it is something that is rarely spoken about in the field, and when it is spoken about, companies and management tend to get defensive when faced with the task of explaining why there are more men named Matt than women who are software engineers in their company. Many will argue that this is due to a lack of women in computer science departments, or women who are even interested in software engineering (you do not need to have a computer science degree to be a software engineer), but this is related to further problems with the rhetoric of tech as a whole. So, it might be interesting to also look at the representation of women in computer science departments, as well as whether these women, once they have graduated, stay or leave the field. The data I could find relating to computer science was limited to the United States, where women make up 18% of computer science bachelor’s degrees. It is often the case that computer science/software engineering/tech can be hostile environments to women due to the existing composition of the fields, so even if a woman works in one of these spaces, how long they stay can be highly dependent upon the workplace experience (though, this can be said for many fields). Overall, it seems that it is hard to find gender disparity data related to specific careers, though it would be highly useful to see the actual composition of tech jobs within companies, not just a broad view that allows an aggregate of positions to represent a higher, though still low, amount of gender diversity.

 

Measurements of Equality

I unfortunately missed class last week due to a cold, so I can only draw on information presented in the readings. Across all of the readings, I found the idea of equality as a performative measure to be most intriguing. In a political landscape in which equality is becoming more and more emphasized, it seems that there has been a trend towards focusing on the broad category of “women,” as opposed to focusing on the inherent stratified nature of equality within the gender (i.e. if more women are advancing into positions of political power, yet aren’t using that political power to advocate for the interests of other women, especially those possessing less privilege, is it really a step forward for gender equality?). I was unfamiliar with the statistical complexities of measuring gender inequality (particularly the motivations behind using one measure over another); however, coming from a data-intensive background, I was not surprised that these statistical measures can be heavily affected by data that misrepresents the actual composition of a region. It seems that if people were viewed less as mathematical objects and more as humans, the data might not ignore large sectors of the population in favor of presenting a region as more equal than others.

Although I had suspected that organizations purporting to advocate for equality only really advocate for the advancement of a select few, it was interesting to look at the economic, social, and political factors that influence not just the discussion around gender inequality, but also the policies that are put in place to increase measured equality. Overall, it might be better if organizations understood more of the complexities that the discussion surrounding gender entails and lessened their use of equality as a buzzword that makes them only appear as allies to women, while further acting as perpetrators of inequality through their data and policy curation practices.

Emma’s Intro

Hello, my name is Emma, and I am a first-year in the Slavic PhD program. I graduated from New College of Florida in 2019 with a joint bachelor’s degree in Computer Science and Russian Language & Literature, so the work that I do lies at the intersection between these two areas, utilizing natural language processing techniques to extract information from nineteenth-century Russian literature. Past work I have done in this area has been in quantifying the semantic similarity between music and sexuality in Tolstoy’s The Kreutzer Sonata, using the algorithm word2vec to generate word embeddings (vector representations of tokens within a work), then measuring the cosine similarity between these vectors: words whose vector representations have the smallest cosine angle between them are most similar and appear in similar contexts in a work.

Currently, I’m working on an implementation of Named Entity Recognition (NER), an algorithm that takes a text as input, then outputs a tagged version of the text where the names of characters, places, etc. are identified as such.  NER is typically trained on data such as news stories, tweets, and Wikipedia articles; however, the naming patterns that appear in text of this kind are distinct from the naming patterns that occur in literature, which can take on a more nested form. The implementation of NER I’m working on is trained on a corpus of Russian literature in the original, so that when it is tested on other works of Russian literature, it is able to pick up on the syntactic forms that are distinct to text of this type.

I’m taking this course due to my general interest in the digital humanities, as well as a desire to distance myself further in methodology from what has become standard in the field of computer science, where quantitative rigor is given precedence over philological interaction with source text. Similarly, my goal for the course is to familiarize myself with various critical standpoints in the digital humanities across fields, not just in work with text, to better understand the implications of digital work with other forms of data.