STAT 542 Homework 7:

NOTE:  Please carefully read the ALTERATIONS to the book's questions in the Chapter 19 exercises below!!!

NOTE for Problems 1 and 2: Recall that the following code will prepare the Macbeth text for analysis:

library(tidyverse)
library(mdsr)
data(Macbeth_raw)
macbeth <- Macbeth_raw %>%
  str_split("\r\n") %>%
  pluck(1)  
# plucks the first element from the list

Problem 1 in the Chapter 19 Exercises (in Section 19.5 of the book).
NOTE for 1(a):  You can actually use the 'str_subset' function, which we focused on in our class, rather than the 'grep' function (which does similar stuff).
For your answer to 1(a), print the first 10 (at least) speaking lines, and ALSO state how many speaking lines there are in total.
HINT: Printing the first few "speaking lines" may help you adjust your final answer about how many total speaking lines there are!

NOTE for 1(b): 
-For the purposes of this exercise, define a "hyphenated word" to be any lowercase letter followed by a hyphen (-) followed by another lowercase letter.
-For your answer to 1(b), print all of the LINES that contain hyphenated words.

Problem 2 in the Chapter 19 Exercises (in Section 19.5 of the book).
NOTE for 2(a): Don't worry about whether the word is an adjective.  
Just find all of the words that contain a lowercase letter followed by the letters "more" OR contain a lowercase letter followed by the letters "less".
This will exclude instances of the simple words "more" or "less".
-For your answer to 2(a), print all of the LINES that contain the words described above.  
You should actually work with the 'macbeth' object for this rather than 'Macbeth_raw' as the book states.

NOTE for 2(b):
-For your answer to 2(b), print all of the LINES that contain either Exit or Exeunt (with the first letter of the word capitalized).

Problem 3 in the Chapter 19 Exercises (in Section 19.5 of the book).
NOTE:  For Problem 3, don't worry about guessing the output without running the code.  Go ahead and run the following code:

### Begin code here

x <- c(
  "popular", "popularity", "popularize", "popularise",
  "Popular", "Population", "repopulate", "reproduce",
  "happy family", "happier\tfamily", " happy family", "P6dn"
)
x

str_subset(x, pattern = "pop")                  #1
str_detect(x, pattern = "^pop")                 #2
str_detect(x, pattern = "populari[sz]e")        #3
str_detect(x, pattern = "pop.*e")               #4
str_detect(x, pattern = "p[a-z]*e")             #5
str_detect(x, pattern = "^[Pp][a-z]+.*n")       #6
str_subset(x, pattern = "^[^Pp]")               #7
str_detect(x, pattern = "^[A-Za-p]")            #8
str_detect(x, pattern = "[ ]")                  #9
str_subset(x, pattern = "[\t]")                 #10
str_detect(x, pattern = "[ \t]")                #11
str_subset(x, pattern = "^[ ]")                 #12

### End code here

Then, based on the output for each of the 12 examples, describe in ordinary English what the given regular expression is searching for and what output is returned.

HINT for the 7th example:  Note the different role of the caret symbol ^ when it is used at the beginning of the regular expression (we discussed this usage of ^ in class)
and when it is used as the first character INSIDE the square brackets (we did not discuss this usage of ^ in class, but based on the resulting output you should be able to
figure out what the ^ does when used as the first character INSIDE square brackets).

Problem 10 in the Chapter 19 Exercises (in Section 19.5 of the book).
NOTE:  Don't worry about loading the 'DickinsonPoems' package as it says in the book.  Instead, I have the text file of the poem at the URL:
https://people.stat.sc.edu/hitchcock/lonelyhouse.txt

Everyone should read the poem and describe whether you believe it to be positive or negative in sentiment and mood overall.  Explain your reasoning in at least two sentences.

Then for EXTRA CREDIT, find the mean sentiment value of all the words in the poem (excluding stopwords) based on the AFINN lexicon and state whether the calculated 
mean sentiment value matches your subjective impression of the mood of the poem.
HINT for extra credit: You can use the 'tibble' function to create a data frame with the words of the poem, 
much like I created the objects 'd' and then 'd_clean' in one of our class examples.
And using the 'skim' function on the 'value' column is one simple way to get the mean sentiment value of all the words in the poem (no grouping needed here).

NOTE about format:

For this homework, please turn in TWO files into Blackboard:

The first file should be a Word document or pdf with the answers to the questions in the form of any output requested,
and any written answers or interpretations for those problems that call for writing.

The second file should be a PLAIN TEXT file (.txt file) with the code that you used to create the results, etc. for the problems.
Any lines in this text file that are NOT code should start with # so that they will be treated as comments and not executed.