VIGNETTE STILL UNDER CONSTRUCTION

SpeedReader is designed for high performance text processing and analysis in R. This vignette will go over much of the functionality available in the package, and how to get the most out of SpeedReader. You can get the latest development version of SpeedReader on GitHub here. It is currently under construction, and will see periodic updates through the Fall and Winter of 2017/2018.

Package Overview

SpeedReader is designed to compliment what I see as the mainstream packages for text analysis in R. While I included functionality to preprocess data into a document-term matrix in SpeedReader, I have shifted to primarily relying on quanteda to do all of my preprocessing, and then taking the resulting sparse document-term matrix as input for many of the methods in this package. Quanteda is simply the fastest, most fully featured, and best maintained package for text preprocessing these days, and I say that having implemented much of the same functionality myself. Similarly, for many topic modelling applications, the topicmodels or LDA packages will work just fine. And for many POS tagging tasks, openNLP or SpacyR are totally great.

Where SpeedReader shines is when you need access to some of the functionality that is only provided in this package, or when you need really high performance or are working with extremely large datasets. At a high level, SpeedReader provides the following functionality:

  • A front end for Stanford’s CoreNLP libraries for POS tagging and finding named entities.
  • Term-category association analyses including PMI and TF-IDF, with various forms of weighting.
  • A front end for topic modeling using MALLET, that also reads the results back into R and presents them in a series of data.frames.
  • A set of methods to compare documents and document versions using sequences of n-grams, and ensembles of Dice coefficients.
  • An implementation of the informed Dirichlet model from Monroe et al. (2008), along with publication quality funnel plots.
  • Functions for forming complex contingency tables.
  • Functions of displaying text in LaTeX tables.
  • Functionality to read in a preprocess text data into a document-term matrix.

In the rest of this vignette, I will spend time discussing this functionality in greater detail, but we will begin with a simple example preprocessing data in quanteda to make it ready for use with SpeedReader.

From quanteda to SpeedReader

For many of the analyses we may wish to conduct with SpeedReader, we will need to start with a document-term matrix (DTM). This matrix contains one row for each document, and one column for each unique term in the vocabulary. While SpeedReader includes the functionality to generate document term matrices on its own, I highly recommend you use the quanteda R package for doing so instead. quanteda is the best tool currently available for preprocessing text in R. I use it myself, and I can say (having implemented much of the same functionality myself), that the efficiency and flexibility of quanteda makes it a better choice for preprocessing than SpeedReader in almost all situations. The one tricky thing about quanteda is that it outputs its own class of dfm object. SpeedReader is primarily built around the slam::simple_triplet_matrix sparse matrix data type, so we will need to convert the object that quanteda produces to a simple triplet matrix. Fortunately, SpeedReader includes the convert_quanteda_to_slam() function to do this. Below is an example generating a document-term matrix with quanteda and then converting it to a slam::simple_triplet_matrix object.

# load the package:
library(SpeedReader)

# Load in example data:
data("congress_bills")

# Use the quanteda::dfm function to generate a document-term matrix: 
quanteda_dtm <- quanteda::dfm(congress_bills)

# Convert to a slam::simple_triplet_matrix object:
dtm <- convert_quanteda_to_slam(quanteda_dtm)

Term-category Associations

A common task in text analysis is determining associations between terms (words, phrases, etc.) and categories (e.g. documents written by members of different political parties, at different times, etc.). The goal is often to find a set of terms that are highly associated with a particular category as compared to other categories. A related task is simply to find terms that are very informative about categories in general, versus those terms that do not distinguish between categories (boilerplate, stopwords, etc.). SpeedReader includes a number of functions to calculated different types of term-category associations, and I will discuss them in turn in this section.

All the methods included in this package for calculating term-category associations take as input a contingency table. A contingency table is a (sparse) matrix where each column represents a unique vocabulary term in the corpus, and each row represents the count of that term in a given category. SpeedReader includes the contingency_table() function to generate these tables. This function takes two objects as input: the first is a document-term matrix as generated in the example given in the previous section. The second is a data.frame of document covariates where each row of the data.frame corresponds to the same row in the document-term matrix, and each column records a numeric/categorical covariate for each document. Note that in creating a contingency table, all covariates will be treated as categorical, so be aware of this if you want to provide a numeric covariate for creating a contingency table. Below is a toy example of a contingency table using the congressional bills example data. We start by loading in the raw documents and forming a document-term matrix. Next, we create some example covariate data (note that the covariate values have been fabricated for this example), and then we feed these two objects to the contingency_table() function:

# load the package:
library(SpeedReader)
# Load in example data and form a dtm:
data("congress_bills")
# Only keep tokens that are all letters and of length 4 or greater:
quanteda_dtm <- quanteda::dfm(congress_bills,
                              select = "[a-zA-Z]{4,}",
                              valuetype = "regex")
dtm <- convert_quanteda_to_slam(quanteda_dtm)

# Now create some fake covariate data
doc_covariates <- data.frame(Chamber = c(rep("House",40), 
                                         rep("Senate",39)),
                             Party = c(rep("Democrat",20),
                                       rep("Republican",20),
                                       rep("Democrat",20),
                                       rep("Republican",19)),
                             Date = c(rep("Jan10",30),
                                      rep("Jan11",29),
                                      rep("Jan14",20)),
                             stringsAsFactors = FALSE)

# Create a contingency table with all possible covariate combinations:
cont_table <- contingency_table(metadata = doc_covariates,
                                document_term_matrix = dtm)

# calculate PMI and display top terms:
pmi_results <- pmi(contingency_table = cont_table,
                   display_top_x_terms = 20,
                   term_threshold = 5)

We may also want to only use certain covariates to form the contingency table. to do so, we specify the variables_to_use argument, which can be a vector of length one or greater containing either the column indices in the metadata data.frame or the column names of the variables we would like to use to form the contingency table. In the example above, we created a contingency table with 2 x 2 x 3 = 12 categories. Below, I will demonstrate examples creating contingency tables using smaller subsets of variables:

# First, lets use numeric column indexing:
cont_table <- contingency_table(metadata = doc_covariates,
                                document_term_matrix = dtm,
                                variables_to_use = c(1,2))

# Next, lets use column names for indexing:
cont_table <- contingency_table(metadata = doc_covariates,
                                document_term_matrix = dtm,
                                variables_to_use = "Chamber")

One final important point to note is that the contingency_table() function returns a slam::simple_triplet_matrix object except that it also contains a “document_indices” attribute which (in the example above, for example) can be accessed by using the attr(cont_table,"document_indices") R command. This attribute records the row indices in the original document-term matrix that are associated with each category and these are required by the feature_selection() function which we will discuss in more detail below. The “document_indices” attribute is a list object and can be saved independently for future use. It is important to note that if one subsets a sparse matrix returned by the contingency_table() function, it will remove the “document_indices” attribute, so you will have to manually save it an then re-assign it if you wish to do some custom subsetting before using the feature_selection() function. In general, it is recommended that the user do any subsetting before creating a contingency table, and then to plug the created table directly into the term-category association functions.

Once we have created our contingency table, we can now calculate term-category associations using one of several functions provided by SpeedReader. We will start with the pmi() function, which calculates the pointwise mutual information of terms and categories, along with the most and least “distinctive” and “salient” terms. Distinctive terms are those that appear overwhelmingly in only a few categories, while non-distinctive terms appear relatively uniformly across all categories. Salient terms are those that are both distinctive, but also appear very frequently in the corpus, relative to other terms. The pmi() function takes a number of arguments which are documented, but I will discuss a few here. The first is display_top_x_terms, which defaults to 20 and determines how many top terms will be displayed to the user for each category. The second is term_threshold, which defaults to 5 and determines which terms are kept in the contingency table for the purposes of calculating PMI based on their frequency in the corpus. PMI term rankings are often very sensitive to infrequently appearing terms, so setting a reasonable threshold for the number of times a term must occur can be helpful.

# calculate PMI and display top terms:
pmi_results <- pmi(contingency_table = cont_table,
                   display_top_x_terms = 20,
                   term_threshold = 5)

The pmi() function returns a list object with 8 entries, in addition to printing out useful information to the screen. Of note, in the printed output, the “Local Count”" refers to the count of the term in the current category and the “Global Count” refers to the total count of the term in the contingency table. The “ePMI” is just the exponentiated PMI or “lift” of the term. The 8 entries in the list object are described below:

  • $pmi_table – A (sparse) matrix object containing the PMI values for all terms with non-zero counts in each category. For those terms with zero counts, the PMI should be negative infinity, so you will need to manually adjust these values for comparison between sparse and dense representations.
  • $pmi_ranked_terms – These are all terms that appear in each category ranked by their PMI score, for each category. This is stored as a list object with a vector corresponding to each row in the PMI table.
  • $ranked_pmi – These are the PMI scores corresponding to the ranked terms in the previous list entry, and follow the same structure.
  • $distinctive_terms – This is a vector of all terms in the vocabulary ranked by their distinctiveness score.
  • $non_distinctive_terms – This is a vector of all terms in the vocabulary ranked by the negative of their distinctiveness score.
  • $salient_terms – This is a vector of all terms in the vocabulary ranked by their salience score.
  • $non_salient_terms – This is a vector of all terms in the vocabulary ranked by the negative of their salience score.
  • $contingency_table – This holds the contingency table used to calculate the PMI table. Can be useful for replicating results.

Assessing Document Editing

In this section, we are going to cover a method for assessing the similarity between pairs of documents, which is implemented in SpeedReader. Note that this method requires documents to be in their original form (a single .txt file or string per document), and does not require a document-term matrix. The idea is that we give the document_similarities() function a character vector of documents (one document per entry), or point it to a folder containing .txt files, and then it will automatically produce a bunch of similarity statistics between all possible document pairs. We can also give the function an additional argument asking it only to compare certain pairs of documents. For now, it is mostly important to know that we will provide this function with input, and that it will produce a data.frame as output (or save the results to disk), with a bunch of metrics for each pairwise comparison. After we see how the code works, I will talk about what these comparison metrics mean.

To start out, you will want to download some example data, which can be found here. This zip archive contains 79 text files covering all versions for the first 20 bills introduced in the U.S House of Representatives, and all versions for the first 21 bills introduced in the U.S Senate during the 103 session of Congress (1993-1994). This totals 79 documents ( with multiple versions for many of these documents). You should download the zip archive, save it somewhere you can find it and then extract it to a folder so you can take a look and use it in this example.Before we do anything else though, we need to load the SpeedReader package:

# Start by loading the package:
library(SpeedReader)

I am going to start out by providing an input directory containing the 79 documents, and making comparisons between all of them:

# First, we will want to set our working directory to folder where all of the 
# bill text is located. For me, this looks like:
directory <- "~/Desktop/Bill_Data"
# but it will likely be different for you. It is a good idea to save this path
# as a separate string, as we will need to provide it to the 
# document_similarities() function later. Now we can go ahead and set out 
# directory:
setwd(directory)

# Once we have done that, we will want to get a list of all files contained in
# this directory. Alternatively you can create this character vector of file
# names manually, or read it in. The point is that we should be left with a
# character vector containing all of the names of the .txt files associated
# with all of the documents we want to compare, and no other file names. You
# will want to double check this vector is correct.
files <- list.files()

Now that we have our list of files, and the directory where they live, we can give this information to the document_similarities() function so that it can calculate similarities for us. We will need to specify the filenames and input_directory fields, and then we can set the ngram size on which we want to compare the documents. for text without stopwords removed, I like to use a number between 5 and 10, but you will have to play around with this and look at the output to find a number that works best for your corpus. I am also setting parallel to FALSE, and selecting prehash = TRUE. In general, it is preferable to set prehash = TRUE, as this will dramatically speed up computation, but this will also use more RAM, so in cases where you are dealing with a very large number of documents, you may want to set it to FALSE if you are running out of RAM (however, this will make the comparisons take much longer). Using the arguments shown below, we will make all pairwise comparisons between att 79 documents (3,081). On my Mac laptop, this takes about 30 seconds on 1 core:

results <- document_similarities(filenames = files,
                                 input_directory = directory,
                                 ngram_size = 5,
                                 prehash = T)

What gets returned to us is a data.frame with with 3,081 rows, and 41 columns. Looking all the way to the last four columns, we see doc_1_ind and doc_2_ind columns, and doc_1_file and doc_2_file columns. The “ind” columns tell us which two input files were being compared by referencing their positions in the files character vector. The “file” columns give us the actual file names, which can be quite useful for linking back up to other metadata. You will see that there are a number of other columns that reference “v1” and “v2”, and these refer to doc_1 and doc_2 respectively. Furthermore, whenever “addition” is referenced, this has to do with text that is in the “v2” or doc_2 document that was not found in the “v1” (doc_1) document. Similarly whenever “deletion” is referenced, this has to do with text that is in the “v1” or doc_1 document that was not found in the “v2” (doc_2) document. Alternatively, if we had the documents already read into R as a character vector (one entry per document), then we could have just used the documents argument. There is an example of this method below (using a builtin version of the same data). Note that this is often more unwieldy, unless you were provided with the data as a document vector.

# Load in the Congressional Bills:
data("congress_bills")

# Generate similarity metrics:
results <- document_similarities(documents = congress_bills,
                                 ngram_size = 5,
                                 prehash = T)

Note that if we run this version of the code, we will get back a data.frame with two fewer columns, as the last two columns (containing the file names) are no longer necessary.

Additionally, we may want to only run our code for a subset of document comparisons. In our example data, we may only want to compare versions of the same document. To do so, we will need to make use of the doc_pairs argument.

# Get the filenames:
directory <- "~/Desktop/Bill_Data"
setwd(directory)
files <- list.files()

# Break them apart into their constituent parts and form a data.frame:
metadata <- data.frame(chamber = rep("",length(files)),
                       bill = rep("",length(files)),
                       version = rep("",length(files)),
                       stringsAsFactors = FALSE)
# Generate the metadata:
for (i in 1:length(files)) {
    # split up the file names:
    temp <- stringr::str_split(files[i],"(-|\\.)")[[1]]
    # save the relevant parts:
    metadata[i,] <- temp[2:4]
}

# Now find all document pairs:
doc_pairs <- NULL
# Start with the 20 HR bills:
for (i in 1:20) {
    cur <- which(metadata$chamber == "HR" & metadata$bill == i)
    
    if (length(cur) > 1) {
        temp <- t(combn(cur,2))
        doc_pairs <- rbind(doc_pairs,temp)
    } 
}

# Move on to the 21 S bills:
for (i in 1:21) {
    cur <- which(metadata$chamber == "S" & metadata$bill == i)
    if (length(cur) > 1) {
        temp <- t(combn(cur,2))
        doc_pairs <- rbind(doc_pairs,temp)
    } 
}


# Generate similarity metrics:
results <- document_similarities(filenames = files,
                                 input_directory = directory,
                                 doc_pairs = doc_pairs,
                                 ngram_size = 5,
                                 prehash = T)

As you will see, this only generates 77 comparisons, corresponding only to comparisons between two versions of the same bill. A similar approach to the code shown in this example can be applied to other corpora, with the appropriate modifications to the code.

In addition to the basic functionality shown above, there are a couple of other options to keep in mind with the document_similarities() function. The first of these is the cores argument. When the cores argument is set to greater than 1, the document comparisons will be parallelized. This will produce a near linear speedup in the number of cores used (10 times as many cores means it will run in roughly 1/10th the time). However, it is important to note that RAM use will also increase with the number of cores used (using 10 cores can use up to 10 times the RAM). However, this can be a great way to speed things up. One way to control the amount of RAM needed is to specify the max_block_size argument, which only lets each parallel process work on up to max_block_size number of comparisons at a time. If the output_directory argument is specified (highly recommended for large jobs), then when each parallel process is done with its current block of comparisons, it will save those results to disk in the specified directory. This can be a great way to save your work as you go, and use less RAM overall for large numbers of comparisons (especially more than 50 million or so).

Another option for reducing the size of the output data is to impose a unigram similarity threshold on the document similarity statistics that get returned by document_similarities(). By specifying the unigram_similarity_threshold argument (to a number between 0 and 1), the user can only perform the full similarity calculations for document pairs where more than the specified proportion of the n-grams in at least one of them have a match in the other. Selecting a threshold of 0.8, for example, will often reduce the number of comparisons returned by over 99 percent. See the code below:

# Load in the Congressional Bills:
data("congress_bills")

# Generate similarity metrics:
results <- document_similarities(documents = congress_bills,
                                 ngram_size = 5,
                                 prehash = T,
                                 unigram_similarity_threshold = 0.8)

A further option is to have SpeedReader automatically chunk up your document comparisons so that you can run pairwise comparisons between a very large number of documents. Note that the options described below are only applicable if you want to compare each document to all other documents, otherwise, see the approach above. Immagine that you have a directory full of documents as in the first example in this section (the same logic applies to a large character vector full of documents). The bit of code illustrated below will perform all pairwise comparisons among these documents in chunks of 10,000 at a time, in parallel on 40 cores. The way we do this is to use the document_block_size argument. Comparing about 10,000 documents at a time is about the largest number of comparisons you will want to do, as this will result in chunks of 50-100 million comparisons at a time. With documents containing a few thousand words on average, this code should use about 200GB of RAM running on 40 cores. This is a lot of memory, but in my testing on a twin 10-core Xeon E5-2690v2 Ivy Bridge (3.00 GHz) workstation, this setup will do about 10 billion pairwise comparisons a day, which allows one to work with hundreds of thousands of documents. This can be accomplished with the following snippet of code:

# Give the file path to a folder full of .txt files to be compared:

# Now generate comparison metrics, which will be stored in output_directory in
# .RData files.
results <- document_similarities(
    filenames = lots_of_files,
    input_directory = path_to_files,
    ngram_size = 5,
    output_directory = place_to_store_intermediate_results,
    cores = 40,
    prehash = TRUE,
    document_block_size = 10000,
    unigram_similarity_threshold = 0.8)

The nice thing about this chunking approach is that it allows one to work with a huge number of comparisons efficiently. With access to enough cores and RAM, this method could theoretically be applied to all pairwise comparisons between several million documents. The results are then saved in many separate .RData files (each containing some fraction of the ~100 million observations that met the 80 percent unigram similarity threshold). This can make loading in the results on your laptop more feasible, and get around the 2.2 billion row limit on data.frames in R.

The final useful argument to the document_similarities() function is the add_ngram_comparisons argument. This argument allows the user to append additional n-gram comparison statistics to the data.frame returned by the function. The n-gram sizes to be used can be supplied as a vector, and two additional columns will be appended to the results for each n-gram included. These will have the names ngram_x_prop_a_in_b and ngram_x_prop_b_in_a where x is the n-gram size, and the first records the proportion of n-grams in document 1 that have a match in document 2, and the second records the proportion of n-grams in document 2 that have a match in document 1.

# Load in the Congressional Bills:
data("congress_bills")

# Generate similarity metrics:
results <- document_similarities(documents = congress_bills,
                                 ngram_size = 5,
                                 prehash = T,
                                 add_ngram_comparisons = c(1,2,3,10,20))

With all this said, we still need to go over what all of the column names mean in the data.frame that gets returned from this function. For illustration purposes, we are going to use two other functions included in the package that allow us to plot where in these two document versions there are matches and mismatches. In this example, the ngram_sequnce_plot() function shows us which overlapping 5-grams in 103-HR-5-EH have a match in 103-HR-5-IH. All of the blocks shaded blue (blocks each represent a 5-gram in this example, and go from left to right and then down by row) had a match in the earlier version (IH), while all of the blocks shaded in orange did not have a match.

# Load the package:
library(SpeedReader)

# Load in the Congressional Bills:
data("congress_bills")

# Find the locations of overlapping n-gram matches and mismatches in the 
# document pair.
matches <- ngram_sequence_matching(congress_bills[29],
                                   congress_bills[30],
                                   ngram_size = 5)
## Complete in: 0.035 seconds...
# Generate a plot of these matches and mismatches:
ngram_sequnce_plot(matches,
                   custom_title = "Example Comparison of 103-HR-5-IH and 103-HR-5-EH.")

This sequence of matching an mismatching n-grams depicted above serves as a basis for all of the comparison metrics output by the document_similarities() function. Below, I describe how these statistics are calculated, and provide some interpretation for the more complex ones. It is important to note that many of these statistics are heavily related. The reason for outputting so many statistics is to provide the user with the raw materials to make further comparisons. This is because different characteristics may be more or less important in different applications.

  • addition_granularity – This statistic is meant to capture the degree of granularity of the additions made between document 1 and document 2. It ranges between 0 and 1, with 0 indicating that the text of document 1 was completely replaced with one block in document 2 (a completely non-granular addition), and values close to 1 indicating that the additions were very small and sparse (granular) on average. A value of 1 indicates that all n-grams in document 2 were in document 1. Mathematically, this quantity is equal to 1 - (the average length of sequences of n-grams in document 2 that did not have a match in document 1, divided by the number of overlapping n-grams in document 2).
  • deletion_granularity – This statistic is meant to capture the degree of granularity of the deletions made between document 1 and document 2. It ranges between 0 and 1, with 0 indicating that the text of document 1 was completely replaced with one block in document 2 (a completely non-granular addition), and values close to 1 indicating that the deletions were very small and sparse (granular) on average. A value of 1 indicates that all n-grams in document 1 were in document 2. Mathematically, this quantity is equal to 1 - (the average length of sequences of n-grams in document 1 that did not have a match in document 2, divided by the number of overlapping n-grams in document 1). This quantity is highly related to the addition_granularity, but need not be exactly correlated except in the case of complete replacements.
  • addition_scope – This statistic is meant to capture how much of document 2 was not in document 1. It is calculated as the proportion of overlapping n-grams in document 2 that do not have a match in document 1. A value of zero implies that all n-grams in document 2 had a match in document 1, while a value of 0 indicates that none of the overlapping n-grams in document 2 had a match in document 1.
  • deletion_scope – This statistic is meant to capture how much of document 1 was not in document 2. It is calculated as the proportion of overlapping n-grams in document 1 that do not have a match in document 2. A value of zero implies that all n-grams in document 1 had a match in document 2, while a value of 0 indicates that none of the overlapping n-grams in document 1 had a match in document 2.
  • average_addition_size – This statistic captures the average size of the additions made between document 1 and 2. The larger its value, the more “blocky” (and longer) the additions were.
  • average_deletion_size – This statistic captures the average size of the deletions made between document 1 and 2. The larger its value, the more “blocky” (and longer) the deletions were.
  • scope – This is the mean of addition_scope and deletion_scope, and is meant as a summary of overall changes.
  • average_edit_size – This is the mean across addition and deletion sizes.
  • prop_deletions – This statistic ranges between 1 and 0, and is meant to capture how many unique deletions/edits were made between document 1 and 2. The intuition is that there is a maximum number of possible deletions/edits that could be made between the two document versions and still be picked up as unique deletions. A value of 1 for this statistic means that the maximum number of unique deletions/edits were made between versions, while a value close to zero means that there were only a few (or in the limiting case zero) block deletions were made. Mathematically this statistic is equal to ((n-gram size + 1) times the number of unique deletions) / the total number of overlapping n-grams in document 1.
  • prop_additions – This statistic ranges between 1 and 0, and is meant to capture how many unique additions were made between document 1 and 2. The intuition is that there is a maximum number of possible additions that could be made between the two document versions and still be picked up as unique additions. A value of 1 for this statistic means that the maximum number of unique additions were made between versions, while a value close to zero means that there were only a few (or in the limiting case zero) block additions were made. Mathematically this statistic is equal to ((n-gram size + 1) times the number of unique additions) / the total number of overlapping n-grams in document 2.
  • prop_changes – This statistic is equal to the mean of prop_deletions and prop_additions.
  • num_match_blocks_v1 – This statistic records the number of unique seuqnces of overlapping n-grams in document 1 that have a match in document 2.
  • max_match_length_v1 – This statistic records the maximum length contiguous sequence of overlapping n-grams in document 1 that have a match in document 2.
  • min_match_length_v1 – This statistic records the minimum length contiguous sequence of overlapping n-grams in document 1 that have a match in document 2.
  • mean_match_length_v1 – This statistic records the mean length contiguous sequence of overlapping n-grams in document 1 that have a match in document 2.
  • median_match_length_v1 – This statistic records the median length contiguous sequence of overlapping n-grams in document 1 that have a match in document 2.
  • match_length_variance_v1 – This statistic records the variance in the length of contiguous sequences of overlapping n-grams in document 1 that have a match in document 2.
  • num_nonmatch_blocks_v1 – This statistic records the number of contiguous sequences of overlapping n-grams in document 1 that do not have a match in document 2.
  • max_nonmatch_length_v1 – This statistic records the maximum length contiguous sequence of overlapping n-grams in document 1 that do not have a match in document 2.
  • min_nonmatch_length_v1 – This statistic records the minimum length contiguous sequence of overlapping n-grams in document 1 that do not have a match in document 2.
  • mean_nonmatch_length_v1 – This statistic records the mean length contiguous sequence of overlapping n-grams in document 1 that do not have a match in document 2.
  • median_nonmatch_length_v1 – This statistic records the median length contiguous sequence of overlapping n-grams in document 1 that do not have a match in document 2.
  • nonmatch_length_variance_v1 – This statistic records the variance in the length of contiguous sequences of overlapping n-grams in document 1 that do not have a match in document 2.
  • total_ngrams_v1 – This statistic records the total number of overlapping n-grams in document 1.
  • num_match_blocks_v2 – This statistic records the number of unique sequences of overlapping n-grams in document 2 that have a match in document 1.
  • max_match_length_v2 – This statistic records the maximum length contiguous sequence of overlapping n-grams in document 2 that have a match in document 1.
  • min_match_length_v2 – This statistic records the minimum length contiguous sequence of overlapping n-grams in document 2 that have a match in document 1.
  • mean_match_length_v2 – This statistic records the mean length contiguous sequence of overlapping n-grams in document 2 that have a match in document 1.
  • median_match_length_v2 – This statistic records the median length contiguous sequence of overlapping n-grams in document 2 that have a match in document 1.
  • match_length_variance_v2 – This statistic records the variance in the length of contiguous sequences of overlapping n-grams in document 2 that have a match in document 1.
  • num_nonmatch_blocks_v2 – This statistic records the number of contiguous sequences of overlapping n-grams in document 2 that have a match in document 1.
  • max_nonmatch_length_v2 – This statistic records the maximum length contiguous sequence of overlapping n-grams in document 2 that do not have a match in document 1.
  • min_nonmatch_length_v2 – This statistic records the minimum length contiguous sequence of overlapping n-grams in document 2 that do not have a match in document 1.
  • mean_nonmatch_length_v2 – This statistic records the mean length contiguous sequence of overlapping n-grams in document 2 that do not have a match in document 1.
  • median_nonmatch_length_v2 – This statistic records the median length contiguous sequence of overlapping n-grams in document 2 that do not have a match in document 1.
  • nonmatch_length_variance_v2 – This statistic records the variance in the length of contiguous sequences of overlapping n-grams in document 2 that do not have a match in document 1.
  • total_ngrams_v2 – This statistic records the total number of overlapping n-grams in document 2.