Digitization and Decontextualization

The conversations and readings of the past week, while revealing the ways in which the digitization of textual data and text analysis can be extremely helpful to historical research, nonetheless show the many limitations of text analysis and, as Lara Putnam noted in Leon Sharon’s “The Peril and Promise of Historians as Data Creators: Perspective, Structure, and the Problem of Representation”, the dangers present in the “decontextualization of data.” This is also reflected in Jo Guldi’s assessment of the analysis of key words present in the transcripts of parliamentary debates which only reveal the dynamics of legislation within the British Isles, and not the British Empire as a whole. Furthermore, as John Markoff noted concerning his study of French parish cahiers, the regional languages of France in 1789, as well as the different terms that could be employed for the same aristocratic privileges, rendered personal, specialized analysis indispensable for the creation of this study, even though it took on a digital medium. In light of these analyses, it is questionable as to whether human analysis can ever be replaced by digital text analysis, as certain elements of data could be overlooked if not considered within a larger base of information.
This is an issue that I come across in my own research, which primarily engages with documents written by French officials in New France for either the Governor-General or the Ministry of the Marine. While I was unaware of this when I first began my research, certain elements of the text can reveal important details to interpreting the documents. This is particularly the case with handwriting, as well as documents that have different dates. One date is the year in which the document was received by the Ministry of the Marine, whereas the other would likely be the day on which it was written. The only way the two dates, and the correct date of the document, can be established is through an analysis of handwriting. Furthermore, handwriting can also reflect which officials had a secretary. More importantly, however, personal handwriting can indicate if an official, who would have otherwise had a secretary at his disposal, was in a location where bringing a secretary was impossible. This was the case with Charles le Moyne de Longueuil, who was the Governor of Montreal and simultaneously maintained a residence in the Onondaga Nation. An analysis of the handwriting in letters from him to the Governor-General of New France or the Ministry of the Marine can therefore show where he was when he wrote those letters, as he would have had a secretary in Montreal.
In light of the readings and the following conversations in class, I am wondering if these nuances, which can be pivotal in the correct interpretation of a document, can be reflected purely through digital text analysis. Furthermore, if they were reflected, would other information present in the document and its composition be overlooked? In this capacity, is it possible for text analysis to become a reliable means to evaluate (or at least summarize) archival sources without human intervention?

Online versus Physical Space: Are they really that different?

With the rise of text-based search queues and online databases, I wonder how might text-searches their algorithms create unpredictable new modes of research. In a physical archive, a researcher can utilize a finding aid and seek help from an archivist in order to create a clear gameplan and subsequently go through a collection with precision and meticulous form. With an online database, often times, a historian might plug in key words and set limits on the search. The researcher no longer is digging through boxes and folders to uncover data. Instead, they are opening curated data that an algorithm decided was most relevant.  Does this negatively impact the discovery of new sources? There have been countless times in which a researcher discovers new and exciting information within a folder or box they were going through, while looking for another source. Because text-based searches pull up only curated content from your search, will these random yet important discoveries occur less? Are historians and researchers missing out on important data by utilizing online databases?

In some instances, I have found that this question goes both ways. I browse online databases often, usually examining newspaper articles. If I click on an individual newspaper article to examine, once the page with the article appears, on the sidebar is a list produced by an algorithm of suggested primary sources that the site thinks is related to my latest click. From these suggestions, I uncovered important articles and primary sources that would not have otherwise appeared in a search. The system and UX recommended these links to other pieces of data to me, as if replicating an archivist within a physical space who is interested in my work and thus offers suggestions on where else to look. I often wonder if I would have found these articles in a physical archive. They were not directly connected what I was looking for, but after browsing through them, these articles offered valuable data and contextualization for the events I am researching. In rare cases, the suggested links turned out to be consequential and excellent finds that directly influence my argument. The biggest concern however, these suggested links were primarily based upon my text word search and the sources I clicked on once I received my results. If I had used a different key word, would these articles ever have been suggested to me? In short, the online archive is increasingly becoming a mainstay in the work of historical research. I believe with it can come a sense of discovery that historians clamor for in physical spaces. The inconsistency and ability to replicate these discoveries due to the use of algorithms remains an issue for historians. I do question, is this any different than the data us historians never come across in a physical archive due to the subjectivity of where an archivist files/stratifies sources? How is this any different than the random nature of valuable data being placed in boxes that we would never look through due to the subjectivity of archivists and finding aids. As such, online databases and physical archives are vastly different experiences. But are the problems that come with both of them that much different?

How Might This Apply to Art History?

The theme of discussion and the readings for last week have brought about two questions for me. First, John Markoff’s cahiers case study illustrates a method for how they were able to overcome gaps in archival records, but how might this apply to ephemeral art objects that are no longer extent? Secondly, how might art historians who prioritize objects apply critical search, outlined by Guldi, to their search for images/objects to support research?

Both of these questions emerge from a tension that I have been experiencing and a larger question that I have been asking myself throughout the semester: as an art historian, what is my “data?” It is easy for me to just set aside the visual and material objects that I am working with and say that my data is the primary and secondary sources that I am using to frame my narrative or interpretation of the objects. However, that answer seems, at the very least, incomplete because I use those sources only as supporting evidence for what I am seeing in the objects themselves, and often I’m also using other artworks alongside the textual sources as support for my claims. The characteristics of the visual and material objects are as much my data as the textual sources that I deal with.

This is a particularly challenging reality when the objects that would be useful for supporting my claims no longer exist because textual descriptions of these objects often cannot articulate precisely enough the formal characteristics of the art object to support visual claims. Ekphrasis is often useful for establishing the existence of an object, but knowledge of its existence is insufficient for many visual arguments. This contrasts with the case study presented by John Markoff in which knowledge of the existence of cahiers was enough to begin to make larger historical claims.

In my own research, I often come up against this when dealing with Eucharistic devotion in 16th- and 17th-century Rome. During the period, Eucharistic devotion often centered around what are described as elaborate and theatrical stages on which a monstrance was placed to display the Eucharist for worshippers. These stages, though they often were designed by leading artists of the period, were ephemeral. They were dismantled after the event was completed. From the textual descriptions, I can make historical claims about these ephemeral works. I can, however, say very little art historically about them because I can’t see them. When drawings or prints illustrating these backdrops are extent, I can then begin to make visual arguments and hypothesize about how they might relate visually to other artworks.

The second question emerged for me out of the Guldi reading because I often find quite time consuming to find images (or more generally art objects) that support the arguments of my research. These objects (data) exist but are not often able to be found with a quick search. Part of the reason for this in my case is that I deal largely with artwork in situ or more generally smaller architectural spaces, such as chapels. For my current research project, I am finding that I actually have to find interior pictures of each of the Baroque churches in Rome and virtually walk through them to find the data I need. It would be amazing to be able to apply a sort of Guldian critical search method to find these objects and supporting material, but I can’t quite envision how that might work. I wonder if I might open this up to the class to see if others, especially those who deal material culture, have thought about how critical search might be adapted for visual sources. Any thoughts?

Broadening data, iterative searching, and uncovering capta

I appreciated thinking about data from a different point of view—that of an historian. Specifically, I appreciated the way the readings pushed us to think about how things that might not normally be seen as “data” (like the way an historian combs through an archive, takes notes, organizes quotes, makes analytical leaps, and writes it up) are indeed forms of data collection, aggregation, and dissemination. The reason I find this important is that people generally privilege certain forms of data over others. Numbers are more convincing than words, because, as it goes, numbers are “truth” and words are constructed. What opening up the definition to data does is show that all of this is, in some sense, is constructed. Making informed and careful choices about how data is collected, aggregated, and disseminated matters whether you are dealing with words or numbers.

I was recently writing up some findings from a study I did a few years back in which we looked at first-year students’ information behaviors. We showed students a variety of articles and asked them how credible the information found therein was. Students were very convinced by articles that had graphs, statistics, and any form of numerical underpinning—whether they were corroborated, well researched, or well written or not This, to me, shows the alarming way in which young people (well…lots of people not just young people) trust numbers without a great deal of criticality.

Another vein I enjoyed discussing were different ways to envision searching environments. Indeed, as someone who comes from a library background, I was always trying to get students to create a more open, exploratory posture in relation to their information consumption and their research methods. The Guildi reading was especially interesting to read as it created a counter-narrative to so many student database searches that I’ve seen over the years. These searches generally go something like this: a student comes in, they have an argument in mind, and they want to find “data” (read: quotes) that support it. There is no discovery, browsing, or curiosity in this research method. Instead, Guildi’s critical search model sees the researcher interacting with research material in a much more iterative manner. Such discovery I think is very important for us to cultivate in today’s information age.

Finally, I’ve been thinking a great deal about the Drucker concept of “capta” as I work on my final project. I am looking at the publicly available Covid-19 Dataset that is hosted on Semantic Scholar and created by the Allen Institute for AI. I’m looking at how Kaggle has gamified the dataset as they are trying to create NLP solutions for exploring such a large body of academic research. I’m trying to figure out how the data was aggregated and am having a very hard time doing so. To me cutting out the human hand in all of this makes the data seem sterile, like it just exists. But the critical thinker in me knows that someone’s hand was there—and I want to find out what decisions that hand made when it was putting together the set. How was the data taken, molded, and created for data scientists to work with?

Scanning for Pleasure

I am thinking through Jo Guldi’s article about “critical search” and bringing in my memories from her talk here at Pitt in January titled “A Distant Reading of Property: Topic Models, Divergence, Collocation, and Other Text-Mining Strategies to Understand a Modern Intellectual Revolution in the Archives,” which dove further into her research about British Parliamentary papers and tenant issues. For my research, I am reading the newspaper Lampião da Esquina, a monthly publication in Brazil from 1978-1981 produced for and by gay people. An NGO, Grupo Dignidade, an advocacy group for LGBTQ Brazilians, scanned the individual editions of Lampião in Brazil (date unknown). I mention this to say that I do not have the physical copies of Lampião and did not scan them myself – I am working with only what I found online.

The corpus consists of 35 documents, and according to voyant has just over 1.1 million words. The scanned PDFs were run through an OCR program and allow me to search for keywords. Similar to Guldi’s search for the term “tenant” and its usage, I am interested in how the text in Lampião utilizes “pleasure” (prazer).[1] Performing a keyword search for prazer throughout the entire corpus allows me to see how popular the term is over the span of the newspaper’s publishing, and which issues have a particularly high frequency. For example, running a keyword search in Adobe results in 304 instances of the word prazer. That is, however, the ones that the program can read – certainly there are usages of prazer that escape the search due to poor scanning, definition, or non-standard text-font.

How can I incorporate Guldi’s “Critical Search” in my research of gay identity and publications like Lampião? Regarding “seeding,” I came to Lampião after conducting a broad, internet keyword search for “gay rights Brazil” (or something similar). Several results indicated that Lampião was the first nationally distributed publication and was foundational in establishing a national movement. Indeed, many monographs on the topic also argue for Lampião’s importance. I may be able to “broadly winnow” the corpus by identifying which editions more frequently engage with the term prazer. Hopefully later, then, through “guided reading” may I begin to consider ways to make contributions to the field in general.

Conducting preliminary “Critical Searches” on prazer in Lampião has led me to further questions. Why was there such a large spike in the use of the word in late 1980? When is prazer evoked, in what context, and by whom? What do the writers mean by prazer? What about other similar words like desire (desejo), happiness (alegria), satisfaction (satisfação), or enjoyment (gozo) – why specifically prazer? How, if at all, do the publications for other contemporaneous social movements (like the Black consciousness movement, the labor/socialist movement, or environmentalists) use prazer? I anticipate that applying methods addressed in Guldi’s and others’ publications from the semester will help me identify key moments and actors for further research.

[1] In November of 1978 Lampião da Esquina (Lamp on the Street Corner) introduced a new subtitle – “Lampião discusses the only topic still taboo in Brazil: pleasure.”

Archives, Data, and Historical Research

For this blog post, I’d like to follow Alison’s suggestion to think about how the creation and structuring of data relates to our final projects for this course, drawing on themes from our readings, our class discussion, and my work with Tropy for my final project.

An advantage, as I see it, of Tropy is that it allows researchers to organize their images of archival material and to pair images of archival material with metadata that describes the identifying features of the document. This organizing of images and the pairing of images and metadata has been useful for my research, as my current method of storing images of archival material—simply using folders in File Explorer for large groups of images—has made accessing these images and information about the archival material they represent a labor-intensive task. Because I think it would be unwieldy, if not impossible, to add all of the metadata for a document as the image file’s name, I’ve kept this descriptive information as handwritten notes in a research notebook, with the result that my image files are organized in the order in which I took the photographs, and this in turn reflects the order of the documents in individual box folders in the archive. I suppose that I could create File Explorer folders to correspond to archival box folders, but the idea of separating my images in this way and making it even more difficult to navigate from one document in one folder to another in another folder hasn’t appealed to me.

While it has been satisfying to pair my images with documents’ metadata using Tropy, it has involved working with Tropy’s existing templates for structuring item-level metadata. One of Tropy’s three templates is Tropy Correspondence, which includes fields for the document’s recipient as well as its author. In the number of items I have added to Tropy, nine of them are letters that were either composed and typed or typed from notes or dictation by a secretary. Should the secretary be considered the author of these letters? Would the secretary’s potential authorship depend on whether they themselves composed the letter or whether they typed the letter from notes left or dictation given? What of this information could be determined by a researcher removed from the circumstances in which an archival document was created? My approach to these letters has been to indicate that the letter was sent by the secretary in the item’s title, but to note the secretary’s employer as the author of the letter so that I can use the author field to sort items.

I’ve also been thinking about the question that John Markoff posed in relation to the cahiers project of how well an existing corpus of material in an archive or archives reflects or represents the total amount of material written historically. In some instances, the loss of material may be evident—for example, a letter may refer to a telegram received by the letter’s author, but that telegram may not be preserved alongside the letter in an archive. In other instances, researchers may have to work toward identifying documents that have not been preserved and make assumptions based on the information available in surviving documents. While this may prove frustrating for researchers, noting instances of absence and loss may prompt critical reflection on archives as constructed repositories of information and as resources for historical research.