The last time I wrote, the content of my post focused on frustrations that I had experienced with the limitations on the capabilities of MONK, and the difficulty I experienced even approaching my starting question of what our tool could do to provide us with insight about Hamlet 3.4, that we couldn’t get from just reading the text. Needless to say, the content of this post is very different.
For the duration of our team meeting today, we prepared to deliver our presentation on MONK and its capabilities and explain how it led us to new understandings of Hamlet 3.4. When dividing the topics to be discussed, I found myself assigned with the task of explaining the classification methods that MONK uses, Naive Bayes and decision tree induction, and how MONK uses them to provide useful knowledge. These, being concepts that I had a grasp of (a slippery grasp at that), I felt comfortable in explaining to my fellow team members the information I had absorbed from reading the night before.
Well, as I began talking and explaining my findings by referring to the actual process of using the methods, I realized I hardly understood exactly what I was talking about or where my vague and unconfident sentences were taking me. It was after that meeting that I sat down and furiously (or with committed fervour rather), researched, practiced, and practiced again until I understood exactlyÂ how these were to be helpful to our analysis. The following is what I found.
Text miningÂ or also called data mining, in its shortest possible form of explanation, is a process that revolves around pure mathematical data analytics in order to return statistical data and probabilities based on patterns and sequences observed in the data. MONK, using Naive Bayes and decision tree induction, is among these text mining methods.
The tutorials for Naive BayesÂ and decision tree inductionÂ provide detailed, technical explanations of what they are and the processes of these analytics. In my attempt to get a better understanding of these analytics, I started with these tutorials. For those of you who read them, you will see that when I say detailed and technical, I mean that it looks like english but there were moments when I doubted that it really was.
This section (below), is only half english.
This one, is most definitely notÂ english.
So, I turned where all students turn for short and quick explanations: Wikipedia. In my brief descriptions to follow, there are terms that I must first address in order for the explanations to be coherent.
- Training sets– sets of data used to discover parameters that can provide a probability of predictable relationships between two or more sets of data.
- Test set– A set of data used to asses the strength of the probability that was given by the training sets.
- Over fitting– Crucial to training sets, are when statistical models (such as those in MONK) emphasize and display the minor fluctuations and random errors in the data instead of the relevant relationship, because there are more parameters than there are potential observations.
Naive BayesÂ is a classification method that uses two or more “classes” that are assigned to training sets. It builds knowledge and “learns” comparisons between the two classes, and applies them to classify an unknown text. It is useful for 3 things:
- Categorizing a text.
- Finding features that stand out in a text.
- Characteristics of one text that are common to a large body of texts, like a genre.
The MONK tutorial points out that the interesting aspects that can be seen using Naive Bayes, are those that we would consider “misclassifications.” In this way, Naive Bayes is useful for making a hypothesis and testing it, or going through the process to confirm something you believe you already know.
Decision tree inductionsÂ take the classifications provided by Naive Bayes, and use them to determine the attributes or characteristics that made them so. Below, is a simplified and understandable image of the basic concept of a decision tree, provided by the MONK tutorial.
This is the process that is applied to the data analytics of the decision tree. It determines which aspects are present and which are not, and then logically produces a ‘tree’ of information that leads to probabilities.
ThisÂ is where over fitting is a crucial aspect. When this models grows to become too complex, this means the training data will be too detailed, therefore essentially useless in analyzing texts other than the training set. Instead of ‘learning’ the general relationship between the ideas, it memorizes that particular training set and attempts to apply it elsewhere.
The purpose of my explaining the analytics behind the tool, is because once I understood what the tool was searching for, and how it searched, it made it far easier for me to understand how to use the tool. With a body of text, and a tool that compares one body of texts, to one or more other bodies of texts, it is extremely difficult to determine what to look for that could be significant. Being given the probability and frequencies of words in texts is, despite how simple it may sound, a difficult place to start because there are just too many words.
Nevertheless, this is what I learned.
In general, using the classification tools that MONK had to offer, and practicing using them correctly did not further my understanding of Hamlet 3.4 as much as I had hoped, however, it did confirm what I believed, surprise me with things I believed that were wrong, and open for me a door into the digital humanities by showing me its vast capabilities. For example:
In terms of Hamlet 3.4, I attempted to analyze the scene in comparison to the all the tragedies in order to find what of this scene was characteristically tragic in Shakespeare’s language. Unfortunately, the way that worksets are defined, the closest I could get to this kind of analysis was Hamlet compared to all Shakespeare’s tragedies, and 3.4 compared to the remainder of Act 3. There I became faced with a problem also, what parameters do i assign each scene in order to find out something useful about 3.4?
In the section where it says “click to rate” there is a certain parameter that you are setting. If you filled in “love,” “death, and “betrayal” as themes of the first three scenes into the first three spaces, and hit ‘continue’ then it would return to you the conclusion of which theme scene 4 best fit according to the probability determined by Naive Bayes. Doing this, unfortunately returned no substantial results as the interactions within the individual scenes themselves were too varied from scene to scene.
In attempting to compare the nature of Hamlet to the tragedies, I did the following:
After hitting continue, I set the following parameters:
These parameters returned to me the following classifications using Naive Bayes algorithm:
The intensity of the red next to the title of the play indicates the level of confidence, or the lowest probability of error, that its classification is correct. The predicted rating, is the classification that Naive Bayes provides, based on the 2 classes (historical and fictional) that I have set for it. Â From this, Naive Bayes shows me that it is fairly certain that based on the data I have provided and the data that it has analyzed, there is a certain % probability that it is a fictional play.
When i click Hamlet and the continue, MONK shows me the data that it has found which explains its confidence level.
The nouns that appear in the far right column are those that have given the Naive Bayes algorithm reason for the presence of probable confidence. The “Avg. Freq. Training” column is the number of times that the word appears in the ‘parameter’ plays that I labelled before, and the “Avg Freq Test” column is the number of times that the word appears in the plays that I left to be classified.
The reason that the confidence is not vibrant red in the predictions however, is because of the infrequent words that appear below:
When I click “Decision Tree,” the image that pops up displays the process by which the analytics flipped the tree over to determine what word could act as a classification.
The results displayed above provides the probability of error of the word “unkindness” as the basis of that classification. This decision tree states that in terms of probability, this word had the lowest error rate, and highest predictive performance.
Therefore, from this data, I can conclude that Naive Bayes and the decision tree have determined that there is a higher probability that Hamlet is a play of fiction, rather than history.
In conclusion, despite the various frustrations the group has experienced and the little bits that we picked up about 3.4 in specific, through Naive Bayes and decision tree induction,Â I have learned that classifications are a great place to start. Comparing texts in order to determine aspects of one based off another CAN show you something you never knew, or prove you wrong, in order to provide you with some idea of what you need to look for or what research criteria you need to change.
In terms of research, as we’re doing in ENGL203, learning and being wrong…I think that’s a great way to start.