The Darker Secrets of the Digital Humanities

Another semester comes to an end, and for the first time ever I’ve spent more quality time with my computer than with a good old fashioned book in order to complete my English class. Twitter, WordPress, and WordHoard have consumed my life and have completely flipped the world of Shakespeare around for me. I’ve never been a huge fan of the bard and I’m still not super interested investigating him any further than I’m required too. Having the internet there and the various digital tools to aid me definitely made this semester a lot more enjoyable than the fall semester where it was strictly reading Shakespeare’s works (with a hint of twitter).

The way we chose to investigate Hamlet this semester was by strictly looking for answers to our own questions. The problem with this is that we eliminated anything that we found that doesn’t necessarily fit with our hypothesis; we also tended to eliminate things that we didn’t find interesting. Scott B. Weingart (the scottbot irregular), mentioned in his blog entry entitled, Avoiding Traps, the ideas of sampling bias, selection bias, data dredging, cherry picking, confirmation bias, p-values, positive results bias, file drawer problem, and HARKing. I believe that all of the above are crucial to understanding the digital humanities fully and also, so we don’t make broad or incorrect assumptions about Shakespeare’s literature.



(Original Image from

Beware of Biases

Weingart defines a selection bias as “an error in choosing the individuals or groups to take part in a scientific study”, and a sampling bias is “that it undermined the external validity of a test (the ability of its results to be generalized to the rest of the population)”. So, for both of those to make sense in our classroom we would use our digital tools (WordHoard, WordSeer, Monk, TAPoR, and Voyeur) as the individuals taking part in our study and the sampling bias would simply be the results we garner from them. As we learnt throughout the semester some tools are simply not designed to work and analyze specific portions of the text. Some are better at looking at a specific scene and act (Phase 1), others are better at looking at whole scenes (Phase 2), and there are still some, that I get the sense, that are not great at doing work at either phase and would be better suited comparing the whole text to other works.

I worked with WordHoard for the entirety of the course and personally I felt like it was able to work well during both phases. I was able to gain information that I need relatively quickly; however, I did notice that when I presented my findings to other users who were not using WordHoard they were confused with my findings and screenshots (I even tried kicking it old school and presenting my findings on sticky notes, as seen in my fourth blog post, with no avail). My findings fit perfectly into the concept of sampling bias since it’s unreadable to non-users of WordHoard, making it hard for my finding to reach a wide audience.


To use all the Information, or to not use all the information, that is the question

The Internet is filled with more information than one person will ever need. With our work with the digital humanities we’re just expanding the information that is out there and for me this is a terrifying idea. When I first started elementary school, which was only in 1997, we still did all our research with books, the Internet was still considered “new”. Now, we live in a digital age where anything we want or need to know can be typed into nearly any device and we’ll receive an answer in seconds or less. We must be weary of the answers we receive from the Internet, as a good portion of it is misleading or false. The Internet is full of “trolls” (which Urban Dictionary users define as “Someone who is purposefully posting on a forum/message board/site with the sole aim to irritate the regular members”); in a sense Hamlet could be considered the troll of his day.

(Image from

So what do we do with all this information? Are we just adding fuel to the fire without even realizing it? Are our assumptions and conclusions trolling the digital humanities community and Shakespearean aficionados?

Weingart’s concern about data dredging resonates with me a great deal. For me, this was the most terrifying part of the process. Data dredging is the idea that with all the information out there for us it’s “tempting to find correlations between absolutely everything”. I fell victim to data dredging when I trusted Monk’s findings (HA, why did I ever trust Monk?). In my most recent blog post I talked about using April’s results and testing them in mine. I guess Monk scoured its database and came up with the results below but when I tested them in my tool it came up with zero results.

(April’s Results) 
(My Results)

Weingart was talking about human data dredging but in the case of Monk versus WordHoard, I fell victim the data dredging of Monk and it giving me false-positives. Monk trolled me.


Information Everywhere!

We all want to come off as intelligent individuals who know what they’re talking about so we tend to only share are solid and most interesting information. We are all victims of being a cherry picker (cherry picking isn’t just for sports anymore); we continuously cut away information until we get the strong hypothesis or conclusion that we were searching for.

For example, I looked up the word “love” in WordHoard and it told me that it appeared 65 times in the play. Great! Now I could make the general assumption that love was used in the Oxford English Dictionary definition of “a feeling or disposition of deep affection or fondness for someone” in all 65 occurrences if that would strengthen my argument, cherry picking. However, looking further into the results I see it’s not always used in that context:

Hamlet: As love between them like the palm might flourish, (5.2.40) ✔

Gertrude: For love of God, forbear him. (5.1.276) ✖

Hamlet uses the word love in the proper context of the OED definition, but Gertrude simply uses it as an expression with no significance behind it.

After you’re done cherry picking and data dredging you’re left with about 5% of all the information you’ve collect because that is al you’ve deemed worthy enough to be presented and shared. This is called the positive results bias. All the other information that is left over from your research is discarded, creating the file drawer problem.

The file drawer problem is an issue because without sharing our failures, or inconclusive results, we’re leaving other people to go down the same path. If we worked together as a community and published all our results, the good and the bad, we’d be able to see what works and what doesn’t and be able to provide better feedback and support.


Going Forward

Going forward, new and old digital humanists need to be aware of what their work is doing and how it’s helping or not helping others. Acknowledging the biases that are being formed when we do our research and being conscious to try and strop them is important. If we can stop only publishing our positive results and start sharing our other trials too, which the majority of English 203 did this semester in their blog posts due to all the frustrations and headaches our tools created, we can help and foster one another’s learning.

Data dredging and cherry picking is harder to stop doing because we’re drawn to those results. They’re the ones that bring us closer to our goal and our purpose of research. Sometimes other alleys and opportunities should be looked into before sticking simply to those first positive results.

Weingart also mentioned confirmation bias, p-values and HARKing, which I did not touch on either because I don’t have enough knowledge on the subject (p-values), or I felt that they didn’t quite fit into our classroom (confirmation bias and HARKing). However, from what I read, I do believe they are still important and vital to sustaining and fostering the growing digital humanities. As an individual who is addicted to her computer and the Internet, I hope they’re here to stay and get worked into more of the University’s courses.

Leave a Reply

Your email address will not be published. Required fields are marked *