This vignette is designed to introduce you to the phrasemachine R package. The main function,
phrasemachine() takes a document or list of documents as input and returns a list of phrases extracted from these documents. These phrases can then be fed into the preprocessing pipelines for a number of other text analysis packages in R, including quanteda. A parallel implementation of this package is available for Python users. More information (including easy installation instruction via pip) can be found at the GitHub page for this package.
The paper detailing phrasemachine can be found at the link below:
- Handler, A., Denny, M. J., Wallach, H., & O’Connor, B. (2016). “Bag of What? Simple Noun Phrase Extraction for Text Analysis”. In Proceedings of the Workshop on Natural Language Processing and Computational Social Science at the 2016 Conference on Empirical Methods in Natural Language Processing. [Available Here]
This package relies on a part-of-speech (POS) tagger to extract phrases. The most portable POS tagger available in R comes in the
OpenNLP package. However, the POS tagger this package provides is not as accurate as the current state of the art taggers available in software packages available for other languages (such as
CoreNLP). We intend to eventually incorporate other POS taggers into this package, but for now, if you want the highest accuracy, we suggest using the Python implementation of the package. In practice, there may not be a significant difference in the end results, but we wish to make the end user aware of this possibility.
The release version of the package can be installed from CRAN as follows:
If you want to get the latest version from GitHub, you will need to have the
devtools R package installed first:
Now we can install from GitHub using the following line:
phrasemachine package is installed, you may access its functionality as you would any other package by calling:
If all went well, check out
vignette("getting_started_with_phrasemachine") which will pull up this vignette!
In general, you will need to have Java 1.8+ installed on your computer for the
OpenNLP package to work. There are a number of operating system specific tutorials on the web, and most newer computers meet this requirement by default. However, we expect issues with Java to be the most common problems users encounter when trying to install and use the
OpenNLP package, which we use for POS tagging. In particular, If you are trying to install this package on a newer Mac computer (OS X 10.10+), you may encounter an error when trying to load the package. We suggest you follow the instructions in the blog post [here] to configure R and Java correctly if you encounter an error.
On older operating systems, you may not have Java 1.8+ installed, in which case you will need to install it first before updating your Java settings.
We begin by loading the package and some example data from the
quanteda R package. In this example, we will make use of 5 U.S. presidential inaugural speeches.
library(phrasemachine) # load in U.S. presidential inaugural speeches from Quanteda example data. corp <- quanteda::corpus(quanteda::inaugTexts) # use first 5 documents for example documents <- quanteda::texts(corp)[1:5] # take a look at the document names print(names(documents))
##  "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ##  "1805-Jefferson"
Phrasemachine provides one main function:
phrasemachine(), which takes as input a vector of strings (one string per document), or a
quanteda corpus object. This function returns phrases extracted from the input documents in one of two forms. The first option, specified by selecting
return_phrase_vectors = TRUE returns a list object. Each entry in the list object represents a document, and is a character vector with an extracted phrase as each entry in the vector. If
return_phrase_vectors = FALSE, then a character vector is returned by the function. Each entry in this character vector will be an extracted phrase, and the unigrams in these phrases will be underscore separated. Selecting this option will allow the user to assign the resulting character vector back into a
quanteda corpus object for use in their normal preprocessing pipeline.
The minimum and maximum token length for phrases may be specified via the
maximum_ngram_length arguments, which default to 1 and 8 respectively. The
regex argument can be used to supply a custom regular expression for phrase extraction, but defaults to
which is the SimpleNP grammar in Hander et al. (2016). If
return_phrase_vectors = TRUE then the user may additionally specify
return_tag_sequences = TRUE (the default value is
FALSE), to return the tag sequences associated with each phrase. This can be useful if the user wishes to perform further selection on specific tag patterns.
# run phrasemachine phrases <- phrasemachine(documents, minimum_ngram_length = 2, maximum_ngram_length = 8, return_phrase_vectors = TRUE, return_tag_sequences = TRUE)
## Currently tagging document 1 of 5 ## Currently tagging document 2 of 5 ## Currently tagging document 3 of 5 ## Currently tagging document 4 of 5 ## Currently tagging document 5 of 5 ## Extracting phrases from document 1 of 5 ## Extracting phrases from document 2 of 5 ## Extracting phrases from document 3 of 5 ## Extracting phrases from document 4 of 5 ## Extracting phrases from document 5 of 5
# look at some example phrases print(phrases[]$phrases[1:10])
##  "Fellow-Citizens_of_the_Senate" "House_of_Representatives" ##  "vicissitudes_incident" "vicissitudes_incident_to_life" ##  "incident_to_life" "greater_anxieties" ##  "14th_day" "14th_day_of_the_present_month" ##  "day_of_the_present_month" "present_month"
From here, the user may include the phrases extracted by
phrasemachine() in any downstream analyses.