Broadening data, iterative searching, and uncovering capta

I appreciated thinking about data from a different point of view—that of an historian. Specifically, I appreciated the way the readings pushed us to think about how things that might not normally be seen as “data” (like the way an historian combs through an archive, takes notes, organizes quotes, makes analytical leaps, and writes it up) are indeed forms of data collection, aggregation, and dissemination. The reason I find this important is that people generally privilege certain forms of data over others. Numbers are more convincing than words, because, as it goes, numbers are “truth” and words are constructed. What opening up the definition to data does is show that all of this is, in some sense, is constructed. Making informed and careful choices about how data is collected, aggregated, and disseminated matters whether you are dealing with words or numbers.

I was recently writing up some findings from a study I did a few years back in which we looked at first-year students’ information behaviors. We showed students a variety of articles and asked them how credible the information found therein was. Students were very convinced by articles that had graphs, statistics, and any form of numerical underpinning—whether they were corroborated, well researched, or well written or not This, to me, shows the alarming way in which young people (well…lots of people not just young people) trust numbers without a great deal of criticality.

Another vein I enjoyed discussing were different ways to envision searching environments. Indeed, as someone who comes from a library background, I was always trying to get students to create a more open, exploratory posture in relation to their information consumption and their research methods. The Guildi reading was especially interesting to read as it created a counter-narrative to so many student database searches that I’ve seen over the years. These searches generally go something like this: a student comes in, they have an argument in mind, and they want to find “data” (read: quotes) that support it. There is no discovery, browsing, or curiosity in this research method. Instead, Guildi’s critical search model sees the researcher interacting with research material in a much more iterative manner. Such discovery I think is very important for us to cultivate in today’s information age.

Finally, I’ve been thinking a great deal about the Drucker concept of “capta” as I work on my final project. I am looking at the publicly available Covid-19 Dataset that is hosted on Semantic Scholar and created by the Allen Institute for AI. I’m looking at how Kaggle has gamified the dataset as they are trying to create NLP solutions for exploring such a large body of academic research. I’m trying to figure out how the data was aggregated and am having a very hard time doing so. To me cutting out the human hand in all of this makes the data seem sterile, like it just exists. But the critical thinker in me knows that someone’s hand was there—and I want to find out what decisions that hand made when it was putting together the set. How was the data taken, molded, and created for data scientists to work with?

Online Labor and Covid19

I’m musing today about the way that Covid19 will ultimately affect the economy and how technology, and automation, will play a part in that. Part of what scares me about shutting down all “non-essential” business (though I completely agree that we should do this because human life > money) is that what is deemed “non-essential” may very well not appear again—especially if we find technological work arounds–depending how long this distancing lasts. Furthermore, with more and more people working from home (though we should add the caveat that there are class, gender, racial, and socioeconomic factors involved in who “gets” to stay home as well, who “gets” to be safe), I ultimately wonder how this distancing will render more work devalued and hidden (I’m echoing Gray’s words here)? How will employment models change as the economy changes? How will labor structures change as we decide what is essential, and who fills in the employment cervices between tech and humanity?

In terms of tech companies and the Gillespie piece, he suggests that “platforms do not just mediate public discourse: they constitute it” (199). I’m thinking of the new Covid19 button on Facebook where you can push to get curated “news” about the virus. They do not make it obvious how such content is curated, who is in charge of such curation, and what they gain from giving you access to this information. I’m thinking of a friend recently who told me that she didn’t watch the news anymore because Twitter told her all she needed to know about Covid19. I’m thinking about Gillespie’s quote in which he says “the public problems we now face are old information challenges paired with the affordances of the platforms they exploit: . . . misinformation buoyed by its algorithmically-calculated popularity” (200).

It seems that information about Covid19 is spreading faster than the virus itself—with interesting labor implications following in its wake. I’ll talk about this more in class next week, but I’m thinking of switching my final project to something along these lines. I’ve recently become aware of Kaggle, a data science community, that is running competitions for folks to come up with data models regarding Covid19. They are making datasets available for anyone who wants to play with machine learning to respond to their call—and offering financial incentives to do so. I wonder what they gain from such crowdsourcing (in the guise of “helping the community”). Indeed, it may very well help to use deep learning to come up with “answers” to issues related to Covid19. However, I genuinely wonder how such answers will be monetized and who will benefit from such monetization.

In any case, the Covid19 epidemic is pulling the veneer off of many things that our society struggles with—socially, economically, and informationally. The ways that misinformation has spread (can I take ibuprofen if I think I have coronavirus?) and those who even have access to information in the new home-offices to which many of us are relegated.

A Convergence of Worlds

Curry writes that “against the background of this rereading of the concepts of space and place, much that occurs today turns out to be a matter of place, not space. In fact, the concept of space typically operates either metaphorically or reflectively (680). Given Curry’s discussion of the terms “place” seem to be temporal geographical locals and “spaces” the metaphors and social dimensions that surround those. Interestingly, the Recogito tool does interesting work bridging space and place, giving users the ability to create placial awareness that connect people, places, events, and objects against the backdrop of a map. Stories, like the one I used for my foray into the tool (Eowyn Ivy’s, The Snow Child), can be retold/reconceived spatially. We change the parameters of the story, and therefore are able to analyze it from a different point of view.

As I’ve reflected on these weeks, I’ve tried to unpack how mapping relates to the work I do in composition and rhetoric. Initially I thought they had very little to do with one another. However, Curry’s article that discusses topos ended up surprising me. Indeed, the first time I heard of topos was not in relation to maps, but instead, in relation to writing and rhetoric—my current course of study. I used to teach my first-year writing students about this concept and how it related to classical rhetorical notions of argumentation. Topos (singular) or topoi (plural) are related to the Aristotelian rhetorical formulas of invention. This is the part of composing an argument where the rhetor (the speaker or writer) is coming up with their thesis in relation to their subject of discussion. “Topoi,” therefore, have been conceived of as topics. So seeing the word “topos” used in a discussion of maps was a strange convergence of worlds for me. How do mapping arguments and mapping places compare, I had to ask myself? And, just as important, how do they diverge?

“Indeed, if the word topos itself emerged after the invention of writing, it is nonetheless useful to try to rethink the topographic against the background of verbal activities that do not involve writing. I find telling the connection between the rhetorician’s use of ‘‘topics’’ and the use in oratory of memory systems that rely upon the construction of a memory palace. It has long been recognized that while users of Western languages are, by and large, notoriously bad at holding lists of unrelated things in memory, when those things are embedded in a narrative or associated with symbols they become far easier to remember.” (683)

Curry gives us some insight into the relationship between space and argumentation here. He discusses how topos has much of its initial history to credit to writing. I find it equally telling that when we, teachers of writing, instruct students how to compose scholarly pieces we often ask them to “carve out a space in the conversation” or “survey the landscape of the literature.” We seem beholden to metaphors of mapping to conceive of the way we construct arguments, forward claims, and respond to counterarguments.

Curry also hearkens back to classical rhetoric’s investment in memory (as opposed to written language) and the way that arguments were conceived of (or “mapped”) in ways that would make sense orally. Ancient Greeks were wary of writing because they were worried about how it would affect memory (which they greatly valued). Contemporary scholars of reading also interestingly note that as we read physical books our minds remember where on the page, and where in the book certain things happen. We “map” stories by their places and spaces. This is something that reading online confuses for contemporary students—as we are unable to remember where things are in the same way that we can with print texts. It messes up our memories, our brains.

So ultimately, even though I didn’t expect there would be much about my area of expertise represented in these mapping weeks, I’ve been very surprised! There is a great deal of convergence, actually.

Maps and Gazeteers

I feel a little bit out of my depth with these weeks since I have very little background working with maps or gazetteers and the work I do in general isn’t super interested in the idea of place or space in the way maps and gazetteers are. However, our discussion of information systems and how to organize information in discoverable ways is of great interest to me. The unique problems of mapping with the cultural, historical, and geographical concerns that arise are certainly helpful ways for me to see how information systems can be ever-so-complicated.

After registering for the WHG, I was able to upload the sample data from the tutorial, but had a hard time playing around with it. I imagine this was my own lack of understanding regarding what kind of files these are. I also played around with inserting specific pieces of data into the tool, but didn’t know enough about labeling them to feel like I was doing it super correctly. Even as someone very new to this, I can see how researchers would be able to use this tool, though, to “locate” their work “placialy.” 

When exploring the tool at large, it did feel like it was pretty eurocentric. I tried to search for Seattle, but wasn’t able to explore a whole lot with what was in the current version.  I can imagine folks wanting to do research about the Asian population there around WWII, for instance, and that could be a really cool dataset to include in future iterations. The euro-centricity was discussed in class last time, and I know the data set is expanding, so I wasn’t surprised by my findings. It is really engaging to look at different cities and the links in data that you can find.


When exploring Recogito I uploaded a text version of one of my favorite novels, Eowyn Ivy’s Snow Child, which takes place primarily in Alaska, but also in the Eastern United STates and has characters from Eastern Europe. I felt a lot more adept at using this tool since I was working with a primary text file and could label people, places, and events. I could see how this would be a really interesting way to “map” texts and visualize them interactively. I have a background in literature and I can imagine research questions this tool would help with. 

Also, as someone who has used grounded theory to code texts before, I really liked how streamlined Recogito’s tool was. It was clean and easy to use. It made the coding process less confusing. 


Reflecting on my experience with the WHG and Recogito, I’m reminded of a strain from the reading where they talk about Google maps, and other “involuntary” citizen contributions that seem “crowdsourced” but are really just using loads of data to make the applications run. When we are working with these research tools–WHG and Recogito–our contributions are voluntary and for our own benefit as researchers. However, the implication of large scale maps like Google maps which rely on data from users to predict traffic patterns, for instance, is both very convenient, and very scary.

From the reading:

“Services such as Flickr; tweets, blogs, and other forms of citizen-contributed text can be georeferenced, and the tracks of individuals through space and time can be captured and uploaded in numerous ways. These last may be instances of involuntary citizen contributions, since the individual may or may not be aware of the tracking. While many services are careful to allow their users to opt out of being tracked or of having their locations captured in other ways, there are many exceptions. A vehicle equipped with an automated toll payment system, for example, is logged every time it passes through a toll-gate, and such records have apparently been subject to subpoena in litigation, as evidence of an individual’s location in space and time.”

Naming Places Ch 2

How do we conceive of space and place when we are being tracked? How does our spacial awareness change when we cannot “hide” or “get off the grid”? How does that change the landscapes of our lives?

Information Literacy in VosViewer

For this exercise I chose the topic “information literacy” which had over 3,000 hits in Web of Science.

I was expecting there to be quite a few hits for this, so was pleased by the number. Looking through the list, the results were varied in terms of related topics and disciplines.

Importing things into VosViewer wasn’t too bad with the tutorial’s help. Here is the first network result:

And then when I started playing around, I liked the density mapping with different colors. The heat mapping effect gave a good sense of where to spend attention. The colored topic differences also was a feature I found helpful in my viewing:

What can you learn from the bibliometric network you have created?

It’s always interesting to look at how things are related. Since this is a topic I’m fairly conversant in I would have expected to see things like “library sessions” and “library” come up more than they did. I see “library user” is one of the nodes. “Higher education” and “course” made sense to me. Something that I didn’t expect, but makes sense in hindsight, are the words associated with the kinds of studies done about information literacy. Words like “survey,” “mean score,” “post-test,” “scale,” and “predictor.” Since this is coming from Web of Science I can assume that the research methods skew empirical in the sample dataset, and those kinds of words and topics would make sense in association with studies on information literacy.

How does your choice of data limit your analysis?

Again, since this is Web of Science, we are getting a lot of studies about information literacy. I’d guess that if you did this kind of analysis with more widely circulated texts that are used by run-of-the-mill librarians you’d get more topics that are case studies, anecdotes, or are about teaching one-shot library sessions. You’d find more on pedagogy and teaching practice, in other words. Instruction librarians tend to be practitioners so lesson plans tend to trump empirical research. That isn’t to say the research isn’t out there, or being done, or being read. It might not be quite so immediate as this data set would make it seem.

Obviously, there is a very large node that says “Information literacy instruction” but one thing that is missing are some of these other ideas being connected to it–one shots, pedagogy, and the like.

How can you structure your data to change your analysis?

I’ve played around with a few things here. One thing that I really like are the exploratory features which allow you to zoom in on certain data points. So, since the question of pedagogy came up for me, I tried to use the filters on the side to see what terms were associated with “library session” and I found the following: information literacy session, library instruction session, and session. Then, I can click “session” and see what this word is co-occurring with to see what those relationships are. This is a cool way of answering questions that arise from the initial overview with all of the terms. I felt like this part of the network was lost to me at first, but here I see that there are some articles about education, pedagogy, and library teaching in this dataset. I like to see how they are related and the relative frequency of each (and how related they are as evidenced by proximity).

I’ll be honest, I’m trying to manipulate the data in other ways but am not seeing huge differences in my network output, so I’ll be excited to learn more about this in class.

Bartholomae’s “Inventing the University”

Bartholomae, David. “Inventing the University.” When a Writer Can’t Write: Studies in Writer’s Block and Other Composing Process Problems. Ed. Mike Rose. New York: Guilford, 1985. 134-65.

What can you learn about the number of citations to this article per year since it was published? As is evidenced in the graph the number of times this article is reference shoots up in the early 2010’s and remains popular through the 20-teens. While it’s popularity seems to be waning in the last few years it is much more cited than in the 10 years initially after its publication. I posit this is because the field of rhetoric and composition has come into itself in the past years. There has been more interest in defining the boundaries of the field and perhaps seminal pieces like this help with that work.

What can you learn about who cites this article?  What are their disciplinary identifications? The majority are from English, composition, writing, or rhetoric. However I am seeing linguistics, TSOL (quite a few L2 publications actually), and education scholars as well. I see a few from library science too. The big hitters tend to be College English publications and other NCTE publications associated with writing studies.

Which of these numbers would you prefer to have used in evaluations for hiring and tenure?  Why? I think I’d prefer to have the second number because it provides a more holistic view of a body of research and a trajectory of academic communication than just one article does. It is very interesting, however, to note that it took quite a while for this to spike. Could it be because of digitization and access to the information? Or, did it take that long for folks to find Bartholomae’s work important? I don’t know. Tenure usually takes 6-10 years, though, and if Bartholomae were assessed by his initial numbers, he would not look nearly as impressive as he does now. Just thinking about the tenure process itself as problematic when considering “impact” and the like.

Is this kind of analysis appropriate for all academic fields? Why or why not? I think this is complicated. While metrics are needed for promotion decisions, these metrics only show who is citing your work. There is bias in citations. There have been historical issues of people of color, for example, being under-cited. Women, too are cited less than men. Finally, I wonder how things like creative projects, DH projects, or other un-traditionally published kinds of scholarship can be accounted for here. In my field the journal Kairos is an example of this non-traditional kind of scholarly work. I see that some of our readings deal with this for the week, so I look forward to seeing what they say about digital humanities and bibliometrics.

A final question I have–and I should know the answer to this given my library background–is how does this compare or contrast to the citation search in Google Scholar? The numbers for Bartholomae were different than in Web of Science so what is being indexed in each? Insofar as I’m aware WoS veers more social science and hard science, no? Does this affect the number (it clearly does) and what does that say about us relying on such numbers? Especially for humanists? Furthermore, for outward facing humanities or scholarship that might be reported in non-academic settings, how does that fit into the mix? Being cited by journalists, for example, is a feat in-and-of itself, but it would not show up in these kinds of bibliometric ratings.

What does “Pursuing Parity” mean?

While exploring the dataset this week I couldn’t (as someone trained in rhetoric and composition) get past the fact that countries with unequal access to power were termed “pursuing parity.” To me that was a way of packaging the blow that these countries were at the bottom of the spectrum in a “works-in-progress” sort of way. This is where the politics of the situation gets involved. I imagine since these countries opted into the study and made their data available then they are considered works in progress? I wonder what is gained by terming it this way as opposed to “unequal” or “no parity” or something like that.

With that in mind I decided to compare a few of the countries at the bottom of the list: China, Saudi Arabia, Russia and Brazil.

 

When I search by education rates there is obviously some data missing. Brazil reports no data, and Saudi Arabia’s data is partially there. Russia and China have datasets. So how do you compile a list with incomplete datasets I wonder?

Likewise, the gender wage gap does not include data on any of the poorer performing countries I listed above. With this in mind I went back to read more about the methodology to see what I was missing. How did they account for working with incomplete data and then drawing up a list based on those data?

Behold! An answer to my question! It looks like there are answers for when data is missing and they are discussed in the limitations part of the methodology. While this information is under “limitations” it also serves as a justification for how data was dealt with then it could not be gathered. I note here they especially talk about the missing educational data. I am not knowledgeable enough in statistical methods to know what this means for their findings, but given the status of the project, and the partners working on it, I do find it comforting that the limitations are acknowledged. Maybe this goes back to why these underperforming countries are termed “pursuing parity?” How can you say something dire about them when you don’t have all the data, after all? (Though educated guesses might help us fill in the blanks here. or is that just my Western point of view talking? It may very well be).

I compare the two examples above to the indicator “# of female heads of state.” What this teaches me about data gathering and data availability is that some indicators researchers choose to use are those that they can get answers for with or without a country’s help. It’s easy enough for a researcher to compile a list of female heads of state and compare across nations that way. The countries themselves don’t need to provide the data. This shows that these kinds of data might be privileged over, say, educational data that may or may not be collected by countries. Furthermore, who is to say if the data collection methods in certain countries would be reliable or not? There are social and political factors involved here too.


I’m going to add onto this blog post a bit to report on the data availability project that we were to execute.

I was interesting in learning more about women in tech and STEM. I looked at the Statista database first (which, sadly, Pitt doesn’t subscribe to, but I have super shady ways of getting into the database. Librarian skills and all…). Statista is an interesting database because it’s marketed as a “Global business database” so the information therein is ostensibly to help businesses make choices about products, hiring. Or, for individuals who are interested in global trading markets and companies. 

One of the reasons I do like Statista, though, is that oftentimes you can go back to the original dataset, look at methodologies, reports, and decide if you want to use the data, or what it’s “good” for.

I also found datasets through the National Science foundation (nice, because you can download the set) and PEW research (but that is US-based so not cross-cultural).

In terms of my findings, I tend to be more interested in how people access data (as a librarian) and less what they do with it. I understand that this project had to do with both of those things, though. Yet most of the time spent on this project was me imagining how a student or a lay-member of the public might actually get at this information. Mostly, when people Google questions like this they’ll read a report that breaks down the data for them. Fewer people will be able to actually go look at the data itself. Luckily for this issue the datasets are publicly available (like through the NSF). However, some of the more nuanced data I found through Statista: a database you have to subscribe to. So even where there is a lot of data (women and STEM is one of those areas), who has access to it?

Furthermore, how are people actually searching for these things? What keywords are they using? How does that allow them to find information? Or, does it hide information? Statista’s keyword search works differently than a traditional database. The Wikipedia article on Women in STEM is well-researched but are folks going to the works cited list and clicking on those links? Those are the kinds of information behaviors I’m wondering about with the issue of data-access. 

 

Invisible Women

As I mentioned in class, I just finished Perez’s Invisible Women: Data Bias in a World Designed for Men (2019). Perez cites some of the kinds of research we read for last week in her book.  She talks about how problematic many research methods are regarding everything from city planning to medical research. So many of these areas do not account for gender difference, and therefore they privilege male bodies and needs  as “typical” and women as “atypical.” In some cases the gender data gap leads to daily inconvenience for women, in other cases, it kills them.

The need to consider culture as part of this was also discussed. She tells of companies that made new “clean” stoves to help women in the developing world to avoid smoke inhalation caused by cooking over open fires. However, they failed to take into account the actual needs of these women in their design. They designed stoves that took longer to cook food, so many women reverted back to open fire cooking. In a report they then blamed lack of training of the women and not the stove design (and flawed data-gathering) as the problem that needed to be addressed so the women would start using their products.

This context, and the discussion last week, got me thinking about how data gathering often tries to sterilize the messiness of the world. Women can be hard to gather medical data on, for instance, because of hormonal fluctuations throughout the month. But not dealing with the messiness (like the academics in our case study were shown to always bring up) will continually leave someone out. While not all data can be collected perfectly, striving for better gender representation is surely a worthwhile goal. Women are half of the population, after all. In other words–let’s keep being academics and complicating things!

(Note: sorry I’m missing page number citations here. I listened to the audiobook and it has since been returned to the library!)

 

Elise’s Intro

Hello all,

As I mentioned in class, I am a first year PhD in English composition. I’m also doing the DSAM certificate. With my background as a composition teacher and a writing programs/instruction librarian I am very interested in how students interact with information objects in writing contexts. This might include student source evaluation behaviors, research writing, source synthesis, or citation analysis in student research papers.

I found out about this seminar from my literary theory professor last semester who helped me understand some of the overlap between cultural criticism and the work being done in the information sciences regarding data, algorithmic oppression, post-humanism, and media studies. When I was a librarian I read Safiya Noble’s Algorithms of Oppression and in some ways it was one of the reasons I applied to come back to grad school. While the ways she talked about information were not new to me, using the Black feminist lens applied so directly in this context was. I appreciated the way it shocked my thinking.

So with that in mind, some of my  goals for this class are to see how different disciplines imagine of information systems. What are the different lenses by which they engage with, and unravel, information ecosystems? How might some of these methodologies inform the research I will be conducting for my dissertation eventually?

-Elise