Making State Data Repositories Friendlier for Researchers

by Eli Levine, Chavez Cheong, Ridge Ren, Emily Leung, Sam Wolter

Introduction: The Struggle with Acquiring State-Level Data

Policy analysis focuses on available data and realized outcomes. Having a large amount of publicly accessible data provided by various government institutions at all levels enables researchers to analyze the effectiveness of different policies, and in general, accessing such data is easier now than ever. However, many challenges exist for researchers seeking access to state-level data. This memo reflects on our experience in acquiring and analyzing state mortgage enforcement actions and identifies issues, computational or logistic, that hinder attempts to compare similar data stored in different ways by different states.

The first issue is the large variety of data storage and data retrieval systems in place. As a result of the unstandardized nature of data storage, a document’s location can range from a query-based search engine to hyperlinks on a website. As such, researchers often need to use a wide variety of techniques to acquire data, which can be cumbersome, time consuming, and may require specialized skills – such as web automation – that research teams may not possess.

The second issue is that states tend to have different practices in the way they handle data formatting and metadata. In many cases, it is not only important that researchers be able to access records of public documents, but that they also have access to key metadata – such as the date of the documents or the names of the entities involved – stored outside the data in comma-separated value files (.csv documents) or other format. Access to this data permits trend analysis and pattern recognition without the need for niche and often inaccurate technologies, like computer vision, to extract metadata. State policymakers would also benefit from improved data storage and formatting, as it would allow them to better assess policy efficacy and outcomes.

Issues with Data Acquisition

Differences in Access Policies for Data

In general, it was difficult to determine the policies different states had towards sharing data related to Mortgage Enforcement Actions (MEAs). Some states, like Massachusetts and Ohio, had clear and permissive policies on data access and provided a platform for users to view and download MEAs. Other states, such as Georgia, did not allow public access to MEA records but required a Freedom of Information Act (FOIA) request, and subsequently provided a large folder containing unlabeled MEAs. This diversity in access policies made it difficult for our research team to access the documents and determine what issues might exist in sharing the documents on the American Predatory Lending website.

Lack of Metadata

States generally stored their MEAs as PDFs, even when the PDFs were clearly printed from Microsoft Word documents or other text files. When downloading these PDF files, they had cryptic, non-descriptive names such as “00B37-377A.pdf.” In our case, to conduct research on patterns in MEAs, we needed the following metadata:

  • date of document
  • entities involved
  • summary of subject matter
  • location
  • type of document
  • access to raw documents
  • processed text files for documents

Extracting raw textual data from these PDFs led our team to build an application with Amazon Web Services Textract, Lambda, S3, SQS, and SNS, a relatively complex service with high cost (see our   for more details). This was required to obtain a higher degree of accuracy when converting PDFs to text, as compared to the alternative of using open-source platforms, like OpenCV, which have significantly higher inaccuracy rates and require more specialized knowledge.

Additionally, extracting metadata like location and entities involved often proved to be a slow and manual process that became untenable, especially when the number of documents exceeded 1,000.

Public-Focused Data Interfaces

Even for the sites that allowed us to directly download PDFs, the interfaces were geared towards one-time public use (e.g., a citizen accessing MEAs against a mortgage broker trying to sell them a mortgage), as can be seen in Figures 1 and 2.

Figure 1 MEA Access Portal for Massachusetts

making state repositories 1 - Making State Data Repositories Friendlier for Researchers

Figure 2 MEA Access Portal for Ohio

making state repositories 2 - Making State Data Repositories Friendlier for Researchers

Although this is important, it would have been helpful to have interfaces for researchers too, who may be looking to access a large amount of data. Ideally, we would have had an option to mass-download PDFs or data files for research. As a result of these interfaces, our team had to adapt to a wide variety of strategies to extract MEAs, including:

  • Using Selenium to automate the keying in of data and clicking of links to download PDFs
  • Using BeautifulSoup to scrape MEA data from sites
  • Using Python requests to download PDFs from links

Customizing these methods to each state was a time-consuming and often inaccurate process that could potentially compromise data integrity.

Recommendations for States

As more states move their archival records online, they should consider adopting the following approaches to make their storage systems research-friendly. The Federal Reserve Bank of St. Louis serves as an excellent example of data management that states should use to model their storage and archival systems.

Application Programming Interface

One of the most convenient ways to allow researchers to access large volumes of metadata while also obtaining raw documents is with an Application Programming Interface (API). APIs are easy to use and access, especially when well-documented, and they are also much easier to maintain and update. The Federal Reserve Bank of St. Louis has an excellent API toolkit that allows researchers to access large amounts of data and documents.

Figure 3 St Louis Federal Reserve API documentation page

making state repositories 3 - Making State Data Repositories Friendlier for Researchers

Open Download Link

Alternatively, to meet the dual goals of allowing citizens to individually access documents while also allowing researchers to mass-download data, there could be an option provided to mass-download data with filters and metadata options.

Figure 4 Federal Reserve Mass Enforcement Actions Download

making state repositories 4 - Making State Data Repositories Friendlier for Researchers

Conclusion

How to create and maintain robust, “easy to use” digital databases is a complex question for government entities of all sizes and presents no easy answers. However, in sharing our experiences working with a variety of different state government data, stored in myriad ways, we hope to guide researchers contemplating similar projects and inform future solutions for data storage at the state level.