Intro to Digital Humanities – Day 4

I am a student of the “Introduction to Digital Humanities” course @edx.org.

Today, I learned “Why Data Matters” (lesson 1.4).
I consumed the following lessons:

  1. What is Data?
  2. What is Digital Scholarship

Today’s discussion was: “In his essay, “How Not to Teach Digital Humanities,” Ryan Cordell writes, “…I have become increasingly convinced that DH will only be a revolutionary interdisciplinary movement if its various practitioners bring to it methods of distinct disciplines and take insights from it back to those disciplines.” (Debates in the Digital Humanities External link, 2016, 463)
Based on what you already know about the humanities and any categories of computing or digital technology, can you identify some of the benefits that digital tools of analysis provide to humanities research?
Based on what you already know, can you identify some of the drawbacks or risks that we should all keep in mind when considering how digital tools, methods, and sources shape our understanding of specific research questions?
Please write your own post and then comment on the posts of a few other learners in the course.”

My answer follows.

Title:
Benefits: automation, scale, self-reinforcing research platforms; eventual drawbacks: quality, trust, IP questions

Body:
One of the easiest to understand benefits that digital tools of analysis can bring to humanities research is the same they can bring to any other area of research: performing repetitive tasks, over quantities of materials of “arbitrary” information size. Of course, the “repetitive” task must be coded – and that is much easier said than done – and the “arbitrary” size is not so irrelevant: depending on many factors, software might not scale up as expected, eventually requiring resources not at the reach of the typical researcher. For example, software might behave without issues with x sources of something to work on, but it might shift to unusable just by doubling that quantity, if something exponential is at play.

To clarify what I mean by “repetitive task”, I could pretend to be researching “what is the most painted fruit on Portuguese paintings of the 18th century?”. Imagine that I have at my disposal, digital pictures corresponding to all those paintings. Now I “only” need a solution to automate fruit identification. Maybe someone already trained a neural network for that. Then, such digital tool will be able to build me a histogram of fruits’ presences; something like: apple: 756, pear: 567, etc. To do this without computer assistance, would be much harder. However, this observation assumes the availability of the files, and the quality of the identification tool. If I had to write the tool alone, starting from zero, including all the fundamentals of the underlying Artificial Intelligence, I would surely have the job done faster by checking the pictures myself.

This repetitive task involves a single identification problem. We can all imagine more challenging research questions, such as “can a link between fruits and social status be derived from Portuguese paintings of the 18th century?”. In this second example, the problem of identifying the presence of people in a picture and – harder than that – assigning the person a social status based on his/her clothes, or hair style, or both, or something else, seems incredibly difficult. Yet, it remains “repetitive” and, to hold credible results, it should be performed over a large quantity of sources.

It might be very complex software, but in the end, the tool I am imagining for the previous examples, outputs simple metadata. For each picture, it produces three types of tags: the fruit tag, the person tag (e.g. “with person”/”no person”), and the social status tag (e.g. “noble”, “undetermined”). This data can be helpful to other projects, so one enormous benefit of digital tools is seeding, contributing to, future projects. Many alternative research questions can be formulated and assisted by data made previously available. It is a self-reinforcing virtuous mechanism. The more tools are made available, the more data is possible, the more questions get support, the more results can potentially be harvested.

I see two main classes of problems/drawbacks:

  • the digital tools themselves have associated intellectual property rights, such as copyright and/or specific licenses, and those can be hard to decipher;
  • the quality of the digital sources is relevant: one bad digital input can subvert results, hence the relevance of many museums digitizing their own collections and not pushing researchers to alt-digital-materials;
  • tools can (and will) fail and if no monitoring is performed, researchers risk having trusted what was not trustworthy.

I could mention the potential neglecting of “originals” as one eventual drawback of the availability and comfort in using digital versions, but that is not intrinsic to the digital resources, but rather one possible behavior.

I published my answer as a new post, at the following URL: https://courses.edx.org/courses/course-v1:HarvardX+DigHum_01+1T2019/discussion/forum/340689f62e7b18aeea76197f3df4ac342e5a5807/threads/5d84f63c8149fd0955002901