Mortgage Enforcement Actions Data Analysis Methodology Report

by Eli Levine, Chavez Cheong, Ridge Ren, Emily Leung, Sam Wolter

Introduction

Beginning in Summer 2021 as a Duke Data+ project and drawing on the research of Justina Zou in the 2019-20 academic year, the American Predatory Lending (APL) Data subteam aimed to take a data-driven approach to understanding the 2008 Global Financial Crisis (GFC) through analyzing the text in state Mortgage Enforcement Actions (MEAs). The general aims of the team were as follows:

  • Collect the raw files of MEAs and save them for archival purposes.
  • Analyze overarching trends in the data and conduct an exploratory data analysis (EDA).
  • Extract text from MEAs and analyze the text for patterns through keyword marking and natural language processing (NLP).
  • Where advisable, explore correlations between significant legislative actions and enforcement patterns.
  • Present analysis through interactive dashboards and visualizations, as well as through written state summary reports.

This document details the methodology used to extract, process, analyze and present data from each state’s MEAs. To see the code used in the project, please refer to the team’s GitHub Repository.

Extraction

State governments archive their MEAs in different formats. Some states store their MEAs as a list of links on their state enforcement agency’s website, while others have created databases allowing the public to search a person’s name and see any MEAs issued against that individual (this latter mode of accessible online archiving sought to empower borrowers by improving access to information about the integrity of lenders or their agents).  This variation meant that we had to create two separate approaches, one that focused on web scraping and another that employed code that scraped the database for the relevant files.

Direct Downloads

On the relevant agency websites of many states, IT specialists store PDFs of MEAs as direct links in Excel files.  To pull out these MEAs, we built a quick Python script with the Requests library, which was able to extract files rapidly and accurately. We applied this method to obtain PDFs from Arizona, Alaska, Oregon, Arkansas and Washington. One primary limitation of this approach emerged with states that have large amounts of MEAs; a small percentage of files failed to download. However, since this occasional error appeared to be randomly scattered and only occurred in states with comparatively numerous data points, we did not think it warranted a revision of our methods.

Selenium Scraping

As noted above, some states stored PDFs in online search databases that enable individual consumers or employers to examine the background of potential financial agents. In these cases, we contacted the states to obtain Excel lists of MEA records from 2000-2010 (where available), and used Selenium as a web scraping tool to automate the downloading of MEAS from the search databases. We used this method to obtain data from Ohio and Virginia.

Freedom of Information Act (FOIA)-Sourced Folders of PDFs

Some states do not make their MEAs publicly available. For these states, our team made state freedom of information act (FOIA) requests to obtain relevant files. A key shortcoming of this approach was that in most cases, we were not able to receive any metadata (additional information about MEAs such as the name of entities charged, the date, the type of MEA issued etc.) alongside MEAs, and this meant that metadata either had to be extracted manually or we had to pursue a process of automating metadata extraction (which created a number of logistical issues due to our lack of knowledge of the MEA’s structure). Thus, our team received Georgia’s MEAs through APL team member Jacob Argue’s FOIA request. We received a full folder of Georgia’s MEAs but did not receive any metadata alongside it.

Text Extraction

Extracting Text from PDFs

To extract text from MEA PDFs, the Data subteam built an application interface around Amazon Web Services (AWS) Textract to automatically extract text from PDFs into a csv file. Figure 1 provides a visual depiction of the different components of the application interface, in sequence.

Figure 1 Workflow for AWS Text Extraction (Cheong, 2021)

methodology report 1 - Mortgage Enforcement Actions Data Analysis Methodology Report

Extracting Text from Webpages

While most states stored their MEAs as PDFs, some states stored their MEAs directly on web pages. In these cases, we were able to extract text directly from these webpages using BeautifulSoup, a Python package for web scraping. We obtained the text from Massachusetts MEAs in this manner.

Analysis

Once we obtained, cleaned, and processed MEAs with the methods described above, we moved on to analyzing the data. Here, there were a few sub-steps that the team took: EDA, NLP and statutory analysis.

Exploratory Data Visualization and Metadata Analysis

To identify and analyze base trends in MEAs across different states, the Data team conducted exploratory data visualization using available metadata. The team made visualizations of the number of MEAs over time, the breakdown of MEAs by recipient entity type, and the breakdown of the type of enforcement action over time.

Figure 2 An example of EDA conducted to inform our perspective. This graph shows the distributions of MEA’s labeled as Report and Recommendations in Ohio over time (Levine, 2021).

methodology report 2 - Mortgage Enforcement Actions Data Analysis Methodology Report

We made use of 3 different data visualization platforms over the course of the project: R (ggplot2), Kibana, and Tableau. R was extremely flexible and thus was the platform of choice for generating static visualizations for each state report. However, the team also worked to create an interactive visualization, so that researchers could focus on data segments of interest and identify trends. Kibana proved especially valuable in dashboard prototyping since it facilitates the capacity of dashboard users to slice the data any way they wished and also provides in-built text tokenization[1] and lemmatization[2] purposes. However, while Kibana was effective as a standalone dashboard, it presented significant integration challenges when we explored options for deploying on the APL website. To mitigate this problem, the Data team replicated , preserving most of the features while also increasing portability of our visualizations.

We drew only on metadata for most of our preliminary EDA, which was facilitated by the fact that our initial datasets had relatively similar and complete metadata sets.  As we expanded our analysis to include more states, however, the state-by-state differences among available metadata increased. As an example, across the states of Virginia, Washington and Arizona, metadata available differed, as seen in the table below.

Table 1 Metadata Differences between States Analyzed (Fall 2021)

Metadata/State Virginia Washington Arizona
Date Present Present Present
Entity Type Absent Absent Present
Entity Name Present Present Present
Enforcement Action Type Absent Present Present

This variation limited comparative analysis; one potential workaround would use Named Entity Recognition on the textual data to associate entity types with specific documents.

Natural Language Processing

With a large portion of available data consisting of text, we turned to NLP as the most sensible method for large-scale textual analysis.  Since analyzing patterns of the misconduct cited in the MEAs was a key goal, the team decided to focus on topic modeling, a process that classifies texts into different topics based on similarities in phrases used. A description of the workflow is shown below in Figure 3.

Figure 3: NLP Workflow (Cheong, Levine, 2021)

methodology report 3 - Mortgage Enforcement Actions Data Analysis Methodology Report

The team initially worked with a Python-based framework (Gensim and spaCy). We deployed spaCy because it had an excellent built-in lexicon of words and thus heavily automated the process of identifying individual words in text (tokenization) and standardizing tenses/word forms (lemmatization). Similarly, Gensim provided robust support for spaCY-processed texts, and provided high quality and effective visualization models for the team to better understand the coherence of generated topic models with Latent Dirichlet Allocation (LDA).  Figure 4 provides an example of two related visualizations.  On the left, we see a visual representation of the conceptual “distance” between different topics (how different one topic is from another).  On the right, we see the format for the most common words in a selected topic.

Figure 4 PyLDA Visualization (Cheong, Levine, 2021)

methodology report 4 - Mortgage Enforcement Actions Data Analysis Methodology Report

However, spaCy has some shortcomings, chiefly around processing speed and voracious use of computer memory.  After further exploration of alternatives, the team decided to shift to using the R-based quanteda software, a powerful NLP package. Reviews from industry users[3] suggest that quanteda is 50% faster than gensim and requires 40% less memory consumption. The following two charts compare execution speed and memory consumption between quanteda and gensim.

Figure 5 Execution and Memory Usage Comparison between Quanteda and Gensim (Watanabe, 2017)

methodology report 5 - Mortgage Enforcement Actions Data Analysis Methodology Report

As an added benefit, Quanteda also provided many other metrics for determining topic model coherence for the selection of the number of topics.

Figure 6 Example of Topic Model metrics for Quanteda (Levine, 2021)

methodology report 6 - Mortgage Enforcement Actions Data Analysis Methodology Report

We faced two key challenges in applying NLP techniques like topic modeling to legal texts like MEAs:

  1. MEAs (and most legal texts) generally have a lot of filler text (e.g the Washington Department of Financial Institutions is located on “Israel St.” in Olympia, Washington, which caused “Israel” to pop up as a common word in our world cloud), and this makes it difficult to identify the sections of text that we are interested in. From our estimates, a significant majority of text in MEAs is composed of filler text. For instance, in comparing the length of Ohio MEA text before and after filtering out filler text, we found that across all MEAs examined, an average of 95% of MEA text fits that characterization.
  2. MEAs have very similar terms, such as the names of attorneys general, which hinders coherent topic model formation, as topic models are formed by identifying similar distributions of terms across different documents.

The team solved these problems with two approaches:

  1. Using key marker phrases to identify sections of text that were relevant (e.g. FINDINGS OF FACT, CONCLUSIONS OF LAW).
  2. Tuning the NLP topic models by using custom stopwords that frequently appear in most documents (e.g. names of attorneys general).

These approaches had the downside of being prohibitively time-consuming and thus, it would be difficult to scale toward all documents across states. However, after meetings with different NLP experts, we concluded that within the space of NLP, at present there are no better alternatives.

Keyword Marking

A possibly simpler alternative that the Data sub-team devised involved identifying key phrases in text and leveraging the often-formulaic language in legal documents to our advantage. For instance, since every MEA will quote the exact laws the entity violated, with a dictionary of relevant regulations, we can search for MEAs that contain that specific law and tag documents accordingly. A potential limitation, still to be explored, is that certain states may be using the law as a means of explaining orders to cease activities without explicitly stating which subsections an entity was violating.

Statutory Analysis

To understand the analyzed MEAs, the team reviewed cited statutes when the MEAs included that level of detail. Because some of the relevant laws have been revised since the period under analysis, the team relied on annotated and historical state codes from Lexis and Westlaw to assess the substantive law as it existed at the time of enforcement. The team then summarized and explained the cited law(s) in plain language for inclusion into each state report.

Individual State Methodology

Our analysis of many states shared similar methodologies at different steps, as they shared similar data storage structures or metadata. Since similar methods of data storage occurred across different states, the data team deemed it appropriate and effective to provide summary analysis for each state analyzed.

The states are listed as follows:

  • Alaska
  • Arizona
  • Arkansas
  • Georgia
  • Illinois
  • Massachusetts
  • Ohio
  • Oregon
  • Virginia
  • Washington

Alaska

Number of MEAs 72
Years Collected 2000-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction Python Requests, AWS Textract
Visualized With N.A.
NLP Platform N.A.

Arizona

Number of MEAs 86
Years Collected 2006-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction Python Requests, AWS Textract
Visualized With R – ggplot2
NLP Platform R – Quanteda

Arkansas

Number of MEAs 86
Years Collected 2006-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction Python – Requests, AWS – Textract
Visualized With N.A.
NLP Platform N.A.

Massachusetts

Number of MEAs 353
Years Collected 2001-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction BeautifulSoup
Visualized With Kibana, Tableau
NLP Platform Python – Gensim, spaCy

Ohio

Number of MEAs 6905
Years Collected 2000-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction Selenium, AWS – Textract
Visualized With Kibana, Tableau
NLP Platform Python – Gensim, spaCy

Virginia

Number of MEAs 886
Years Collected 2006-2010
Metadata Available Date, Entity Name, Enforcement Action Type
Methods of Extraction Selenium, AWS – Textract
Visualized With R – ggplot2
NLP Platform R – tidyverse (Keyword Marking)

Note: In scraping PDFs with Selenium, Virginia’s web page was significantly slower and also contained web elements that prevented clicking, hence we had to work around these issues with built-in sleep timers and percentage-based scroll markers.

Washington

Number of MEAs 556
Years Collected 2000-2010
Metadata Available Date, Entity Name, Entity Type, Enforcement Action Type
Methods of Extraction Requests, AWS Textract
Visualized With R – ggplot2
NLP Platform R – quanteda

Appendix

Appendix 1: Arizona

Figure 7 Model diagnostics to determine the appropriate number of topic models in Arizona

methodology report 7 - Mortgage Enforcement Actions Data Analysis Methodology Report

This sample plot from Quentada helps us determine the appropriate number of topics to use when sorting the documents by the relevant keywords. In all graphs we focused on the elbow points. “Held out likelihood” represents the likelihood that a model is accurate when tested against a held out set. Lower bound represents the lower bound for the rate of convergence. Residuals are the difference between observed and predicted number of topics. Semantic coherence measures the frequency of co-occurrence of words in a document. These diagnostics helped us decide that 10 was the appropriate number for Arizona.

Appendix 2: Massachusetts

Figure 8: Model Diagnostic Plots for Massachusetts. These are the coherence plot (top) and PyLDA visualization ( )

methodology report 8 - Mortgage Enforcement Actions Data Analysis Methodology Report

methodology report 9 - Mortgage Enforcement Actions Data Analysis Methodology Report

For Massachusetts, we chose 16 topics due to the spike at 0.36 coherence score. Although there were higher coherence scores for 10 topics or less, we found through pyLDAvis that there were too few topics for us to be able to classify different types of offenses effectively and in a way that enabled analysis. This was a compromise between an effective number of topics and our coherence scores.

Appendix 3: Ohio

Figure 9 Model Diagnostic Plots for Ohio. These are the coherence plot (top) and PyLDA visualization (Bottom)

methodology report 10 - Mortgage Enforcement Actions Data Analysis Methodology Report

methodology report 11 - Mortgage Enforcement Actions Data Analysis Methodology Report

For Ohio, we chose 17 topics due to the spike at 0.3475 coherence score. Although there were higher coherence scores for 6 topics or less, we found through pyLDAvis that there were too few topics for us to be able to classify different types of offenses effectively and in a way that enabled analysis. This was a compromise between an effective number of topics and our coherence scores.

Appendix 4: Virginia

For our analysis of the state of Virginia, we used keyword marking due to the dataset containing a description of each infraction, as well as the law related to the MEA. Thus our analysis of the language was derived from R’s tidyverse package, and not from Quentada. As a result, we do not have model diagnostics as our text analysis came directly from Virginia’s State Corporation Commissions.

Appendix 5: Washington

Figure 10 Model diagnostics to determine the appropriate number of topic models in Washington

methodology report 12 - Mortgage Enforcement Actions Data Analysis Methodology Report

The graph above depicts the recommended number of topics for Washington MEAs. The held-out likelihood increases as the number of topics increases. The lower bound increases as the number of topics increases. The number of residuals decreases and then slightly increase for 20 topics. The semantic coherence decreases from 5 topics to 10 topics, and then increases for 15 topics and then slightly decreases. This graph indicates that the optimal number of topics is around 10.


[1] Tokenization signifies a process of breaking raw text into small chunks. Tokenization further breaks chunks of raw text into component elements, sometimes individual words, sometimes key phrases, and sometimes full sentences; the components are called tokens, which are key units of analysis in developing models for NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

[2] Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

[3] https://blog.koheiw.net/?p=468