Image Credit: NPS.gov

Querying the NARA database

TL;DR

  • NARA is the U.S. national records archive with >100 terabytes of data and an open API
  • The data are deeply nested and a pain to navigate
  • I show how to extract and tidy the useful bits

What is the NARA API?

NARA is the The U.S. National Archives and Records Administration. According to the website, the NARA collection includes:

“approximately 10 billion pages of textual records; 12 million maps, charts, and architectural and engineering drawings; 25 million still photographs and graphics; 24 million aerial photographs; 300,000 reels of motion picture film; 400,000 video and sound recordings; and 133 terabytes of electronic data.”

In other words, if you’re interested in big data, this is your kind of place! Traditionally, most “records” were physical objects stored in archives, but these days NARA is getting better and better about digitazation and open access. I only know this because a certain somebody who does a lot of research in the archives said to me the other day “Hey, NARA says they have an API. What can we do with that?” 🤔

This person happens to be researching the internment of Japanese Americans during WWII. The main U.S. government agency involved in this regrettable activity was the War Relocation Authority.

Googling failed to turn up any existing R package to interface with the NARA API, despite its hugeness. Look like it’s time to roll up our sleeves and get our hands dirty…

Let’s query “war relocation authority” on the NARA API, get those data into R, and see what they have to say!

Set-up

library(tidyverse)
library(jsonlite)
library(listviewer)

Data Import

NARA has an open API, with the base URL https://catalog.archives.gov/api/v1. The github documentation has a very helpful explanation of how to query the API, as well as a sandbox to test things out. Here, we are going to restrict our results to just “items”, which according to NARA’s data model are:

“The smallest intellectually indivisible archival unit (e.g. a letter, memorandum, report, leaflet, or photograph).”

If we were not to do so, the search would also turn up loads of metadata including bits of .gov websites, collections of items, etc. We specify our query terms with q=war%20relocation%20authority and filter by the “item” result type with resultTypes=item.

initial_search <- jsonlite::fromJSON("https://catalog.archives.gov/api/v1?&q=war%20relocation%20authority&resultTypes=item", simplifyVector = FALSE)

How are the data structured? Let’s look at the top 3 levels to get started.

str(initial_search, max.level = 3)
## List of 1
##  $ opaResponse:List of 4
##   ..$ @status: chr "200"
##   ..$ time   : chr "2018-08-13T21:10:54.600Z"
##   ..$ request:List of 5
##   .. ..$ format     : chr "json"
##   .. ..$ pretty     : logi TRUE
##   .. ..$ action     : chr "search"
##   .. ..$ q          : chr "war%20relocation%20authority"
##   .. ..$ resultTypes: chr "item"
##   ..$ results:List of 5
##   .. ..$ queryTime: int 448
##   .. ..$ total    : int 5479
##   .. ..$ offset   : int 0
##   .. ..$ rows     : int 10
##   .. ..$ result   :List of 10

We see that everything is contained in a single list item, which is unnamed. Not sure what the purpose of that is. 😕

Drilling down a bit, $opaResponse contains $request, which holds metadata about our query. $opaResponse also contains $results, which sounds more like what we’re interested in!

Notice that we had 5479 total results, but only 10 items in $results$result. This is because the default API query only returns that top 10 results. Now that we know how many results to expect, let’s specify that in a “real” query. In the API, we can indicate the maximum number of results to return (up to 10,000) with rows=. Note that this one will take much longer to download.

# Wait 5 minutes so we don't overload the API with requests
Sys.sleep(300)

# Set a timer
start_time <- Sys.time()

# Call the API
json_raw <- jsonlite::fromJSON("https://catalog.archives.gov/api/v1?&q=war%20relocation%20authority&resultTypes=item&rows=5479", simplifyVector = FALSE)

# See how long it took
finish_time <- Sys.time()
finish_time - start_time
## Time difference of 28.07262 mins

From our glimpse into the structure of the initial results, we know that the data we’re interested in are nested down in $opaResponse$results$result. Let’s extract those as json_data.

json_data <- json_raw$opaResponse$results$result
nresults <- length(json_data)
nresults
## [1] 5479

OK, so how are the “real” results structured? There’s no way to check each one, so let’s grab 10 randomly and have a look.

set.seed(3098) # for reproducibility
jsonedit(json_data[sample(1:nresults, 10)])

It looks like they have a more-or-less shared structure (6–7 items each), so we should be able to wrangle them into something more tidy.

Data Tidying

The number of headaches this induces depends on how deeply nested the items are. We’ll start easy.

Level 1: Beginner

Within json_data, each item contains just a few items that don’t contain nested lists. We can just use these directly to start building our tidy dataframe. Forunately, none of these contain null values, so we don’t have to deal with that headache. For details on this and other more advanced ways to build tidy dataframes from JSON lists, see Jenny Bryan’s excellent purrr tutorial.

simple_cols <- c("naId", "type", "score")

tidy_data <- json_data %>% 
  map_df(`[`, simple_cols)

tidy_data %>% head
## # A tibble: 6 x 3
##   naId    type         score
##   <chr>   <chr>        <dbl>
## 1 536874  description 0.0585
## 2 536878  description 0.0585
## 3 536259  description 0.0502
## 4 7820143 description 0.0489
## 5 536863  description 0.0433
## 6 536298  description 0.0433

Level 2: Intermediate

Most of the bits of data we are interested in are nested further down in the list though, so this approach won’t work for those. Instead, we will use purrr::pluck(), which as the name suggests “plucks” more deeply nested elements out of the list.

To see how this works, let’s start with a single element of the JSON data that we want to turn into a column in our tidy dataframe. Take for example, the “scopeAndContentNote” field. This appears to contain photo captions, which could be quite interesting.

By examining the JSON data, we see “scopeAndContentNote” is nested within “item”, which is nested within “description”. We can use these as input into pluck to extract just this element. Let’s try that just for the first record.

scopeAndContentNote_location <- list("description", "item", "scopeAndContentNote")
pluck(json_data, 1, scopeAndContentNote_location)
## [1] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Pruning trees at this War Relocatin Authority center for evacuees of Japanese ancestry.\n"

Maybe we can map the pluck function over the whole list and thereby extract “scopeAndContentNote”?

map_chr(json_data, ~ pluck(., scopeAndContentNote_location))
## Error: Result 4 is not a length 1 atomic vector

Nope, this doesn’t work because this field is missing for the fourth item in the JSON data.

We can get around this by setting the .default argument of pluck to NA (Thanks to Hadley Wickham for pointing this out!)

map_chr(json_data, ~ pluck(., scopeAndContentNote_location, 
                           .default = NA_character_)) %>% head
## [1] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Pruning trees at this War Relocatin Authority center for evacuees of Japanese ancestry.\n"          
## [2] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Lunchtime at Manzanar, a War Relocation Authority center for evacuees of Japanese ancestry.\n"      
## [3] "The full caption for this photograph reads: Poston, Arizona. Constructing quarters for evacuees of Japanese ancestry at War Relocation Authority center on Colorado River Indian Reservation.\n"  
## [4] NA                                                                                                                                                                                                 
## [5] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Mealtime at Manzanar, a War Relocation Authority center for evacuees of Japanese ancestry.\n"       
## [6] "The full caption for this photograph reads: Poston, Arizona. Barracks under construction at this War Relocation Authority center where evacuees of Japanese ancestry are spending the duration.\n"

There is still one potential problem to take care of before mapping pluck across the dataset: what if the element will try to pluck is itself a list with multiple elements? To get around this, let’s make a function to aggressively flatten lists and return a character vector of length one.

collapse_list <- function (x, collapse_char = "; ") {
  if (is.null(x)) {return (NA_character_)} # just in case
  if (is.list(x)) {
    unlist(x) %>%
      paste(sep = "_", collapse = collapse_char) -> x
  }
  x
}

Now we can use pluck in combination with collapse_list to extract this particular element from every result in json_data. Note the use of ~ to specify an anonymous function, with . to indicate the argument we’re mapping over.

scopeAndContentNote <- map_chr(json_data, ~ 
                               pluck(., scopeAndContentNote_location,
                                     .default = NA_character_) %>% 
                                 collapse_list) 
scopeAndContentNote %>% head
## [1] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Pruning trees at this War Relocatin Authority center for evacuees of Japanese ancestry.\n"          
## [2] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Lunchtime at Manzanar, a War Relocation Authority center for evacuees of Japanese ancestry.\n"      
## [3] "The full caption for this photograph reads: Poston, Arizona. Constructing quarters for evacuees of Japanese ancestry at War Relocation Authority center on Colorado River Indian Reservation.\n"  
## [4] NA                                                                                                                                                                                                 
## [5] "The full caption for this photograph reads: Manzanar Relocation Center, Manzanar, California. Mealtime at Manzanar, a War Relocation Authority center for evacuees of Japanese ancestry.\n"       
## [6] "The full caption for this photograph reads: Poston, Arizona. Barracks under construction at this War Relocation Authority center where evacuees of Japanese ancestry are spending the duration.\n"
scopeAndContentNote %>% length
## [1] 5479

Cool! But how do we scale up to the rest of the elements we want to pluck?

First of all, we need an easy way to find the locations of the elements we are interested within this gaping black hole of a list. I manually typed in the (rather short) three levels to “scopeAndContentNote”, but I’d rather not do that for everything.

Here is where RStudio really shines, and it would be a much bigger pain without it. If you click on the json_data in the “Environment” tab, or just type View(json_data) in the console, you’ll get a view of the nested list similar to that produced by jsonedit(). We can drill down through the nested list and identify the bits of data that matter. Once you find something that you want to keep, click the “copy” icon on the right side. This will copy the double-bracket code needed to grab that element to the console.

Here’s a twitter-view of how that works:

That’s very handy, but unfortunately pluck needs the directions to the thing to pluck either as a function or a list, not a bunch of strings in brackets. So, here is a fairly hacky function to convert a line of strings in brackets to a list.

brackets_to_list <- function (x) {
  gsub("]][[", "_", x, fixed = TRUE) %>%
    gsub("[[", "", ., fixed = TRUE) %>%
    gsub("]]", "", ., fixed = TRUE) %>%
    gsub('\"', "", ., fixed = TRUE) %>%
    stringr::str_split("_") %>%
    purrr::flatten()
}

title_location <- '[["description"]][["item"]][["title"]]'

brackets_to_list(title_location)
## [[1]]
## [1] "description"
## 
## [[2]]
## [1] "item"
## 
## [[3]]
## [1] "title"
pluck(json_data, 1, brackets_to_list(title_location))
## [1] "Manzanar Relocation Center, Manzanar, California. Pruning trees at this War Relocatin Authority cen . . ."

We can now (finally) start scaling up by copying the “addresses” from RStudio’s list viewer into a list, then mapping that list onto brackets_to_list. Note that the “addresses” must be copied within single brackets (I’m pretty sure there’s a tidyeval way around this but haven’t figured it out yet… comments welcome).

list(
  contributor = '[["description"]][["item"]][["personalContributorArray"]][["personalContributor"]][["contributor"]][["termName"]]',
  scopeAndContentNote = '[["description"]][["item"]][["scopeAndContentNote"]]',
  records_type = '[["description"]][["item"]][["generalRecordsTypeArray"]][["generalRecordsType"]][["termName"]]',
  title = '[["description"]][["item"]][["title"]] ',
  parent_series_title = '[["description"]][["item"]][["parentSeries"]][["title"]]',
  start_date = '[["description"]][["item"]][["parentSeries"]][["inclusiveDates"]][["inclusiveStartDate"]][["year"]]',
  end_date = '[["description"]][["item"]][["parentSeries"]][["inclusiveDates"]][["inclusiveEndDate"]][["year"]]',
  creator = '[["description"]][["item"]][["parentSeries"]][["creatingOrganizationArray"]][["creatingOrganization"]][["creator"]][["termName"]]',
  thumbnail = '[["objects"]][["object"]][["thumbnail"]][["@url"]]',
  record_group = '[["description"]][["item"]][["parentSeries"]][["parentRecordGroup"]][["recordGroupNumber"]]',
  archive = '[["description"]][["item"]][["physicalOccurrenceArray"]][["itemPhysicalOccurrence"]][["locationArray"]][["location"]][["facility"]][["termName"]]'
) %>%
map(brackets_to_list) -> lookup_terms

The next step is to map our anonymous plucking function (which works on one column) onto the full dataset (to go across all the columns). For this, we define a new function to pass to map_df. This nested mapping makes my head hurt, but it works. Note that the version here is restricted to character vectors, which works for this dataset, but may need to be changed for other cases. Again, Jenny Bryan’s post is the go-to.

make_df <- function (data, lookup_list, col_name) {
  sym_col_name <- sym(col_name)
  map_chr(data, ~ pluck(., lookup_list[[col_name]], 
                        .default = NA_character_) %>% collapse_list) %>% 
    tibble %>% select(!!sym_col_name := .)
}

Use map_dfc to bind the results by column, not row.

other_cols <- map_dfc(names(lookup_terms), 
                      make_df, 
                      data = json_data, 
                      lookup_list = lookup_terms)
other_cols %>% head
## # A tibble: 6 x 11
##   contributor  scopeAndContentNote    records_type title parent_series_ti…
##   <chr>        <chr>                  <chr>        <chr> <chr>            
## 1 Albers, Clem "The full caption for… Photographs… <NA>  Central Photogra…
## 2 Albers, Clem "The full caption for… Photographs… <NA>  Central Photogra…
## 3 Albers, Clem "The full caption for… Photographs… <NA>  Central Photogra…
## 4 <NA>         <NA>                   Textual Rec… <NA>  Construction Com…
## 5 Albers, Clem "The full caption for… Photographs… <NA>  Central Photogra…
## 6 Clark, Fred… "The full caption for… Photographs… <NA>  Central Photogra…
## # ... with 6 more variables: start_date <chr>, end_date <chr>,
## #   creator <chr>, thumbnail <chr>, record_group <chr>, archive <chr>

Level 3: Facepalm

Great, we’re almost there. There’s one last tricky list element to tackle: user-contributed tags. This one also looks quite interesting for data analysis, so I want to be sure to include it. These are tags that anybody can add to the database either through the UI or the API. Unfortunately they are… (you guessed it) a deeply nested list! We can find their location through my now tried-and-true paste from RStudio viewer method.

tag_location <- brackets_to_list('[["publicContributions"]][["tags"]][["tag"]]')

Let’s extract these up into their own list to at least remove some layers of complexity.

tag_list <- map(json_data, ~ pluck(., tag_location, .default = NA_character_)) 
jsonedit(tag_list[1:10])

Two things are immediately obvious: some items have many tags, and some have none. For example, the first item has 17 tags. But it’s not that easy… the tags themselves (indicated by the name “$”) are contained within nested lists that also contain other metadata.

To extract just the tags, which are located one level down within each item, we use another mind-bending nested map. We then collapse down the resulting character vectors into a single-length vector each by mapping paste().

mult_tags <- map(tag_list, 
                 ~ map_chr(., ~pluck(., "$", .default = NA_character_) %>% collapse_list)) %>%
  map_chr(paste, collapse = ", ")
mult_tags %>% head
## [1] "World Ward II, WWII, War, Homefront, Internment Camp, Japanese American, Tree Climbing, Pruning, Man, Manzanar Relocation Center, Manzanar, California, War Relocation Authority, WRA, 1942, Clem Albers, Internees"                         
## [2] "Manzanar Relocation Center, Manzanar, California, Lunchtime, Meal, WRA, War Relocation Authority, Internment Camp, Japanese American, Civilian, WWII, World Ward II, War, Homefront, Clem Albers, 1942"                                      
## [3] "Japanese Internment Camp, World War II, WWII, War, Homefront, Construction, Colorado River Indian Reservation, 1942, Arizona, Poston, Parker, Barrack Construction, Clem Albers, WRA, War Relocation Authority, Poston War Relocation Center"
## [4] "Jerome County, Idaho, construction directives, Japanese, climate, housing, military police, barracks, hospital, sanitation, administration, utilities, costs, photographs, maps, jwric-ts1"                                                  
## [5] "NA"                                                                                                                                                                                                                                          
## [6] "NA"

I called these “mult_tags” because each contain more than one tag. It turns out items with only one tag lack a level of nestedness. We want these too, but don’t need to use the nested map for them.

single_tags <- map_chr(tag_list, ~pluck(., "$", .default = NA_character_) %>% collapse_list)
single_tags[!is.na(single_tags)] %>% head
## [1] "water tower"       "Japanese"          "construction"     
## [4] "cots"              "ArchivesGameNight" "MRC-tg1"

Now, we can finally combine both sets of tags into the final vector.

tags <- coalesce(single_tags, mult_tags)

That paste(collapse) left a bunch of "" elements in our vector though. Let’s convert those to proper NA.

tags <- map_chr(tags, function (x) if (x == "") {NA_character_} else {x})

Now we can combine these with the “simple” columns into our tidy dataframe.

tidy_data <- bind_cols(tidy_data, other_cols, tags = tags)

# Save our data so we can use it for analysis in the next blog post
readr::write_csv(tidy_data, "nara_data.csv")

tidy_data %>% head
## # A tibble: 6 x 15
##   naId   type    score contributor scopeAndContentNote  records_type title
##   <chr>  <chr>   <dbl> <chr>       <chr>                <chr>        <chr>
## 1 536874 descr… 0.0585 Albers, Cl… "The full caption f… Photographs… <NA> 
## 2 536878 descr… 0.0585 Albers, Cl… "The full caption f… Photographs… <NA> 
## 3 536259 descr… 0.0502 Albers, Cl… "The full caption f… Photographs… <NA> 
## 4 78201… descr… 0.0489 <NA>        <NA>                 Textual Rec… <NA> 
## 5 536863 descr… 0.0433 Albers, Cl… "The full caption f… Photographs… <NA> 
## 6 536298 descr… 0.0433 Clark, Fre… "The full caption f… Photographs… <NA> 
## # ... with 8 more variables: parent_series_title <chr>, start_date <chr>,
## #   end_date <chr>, creator <chr>, thumbnail <chr>, record_group <chr>,
## #   archive <chr>, tags <chr>

Phew!😓 Finally all tidy.

Inspecting the Data

Let’s get an overview of some of the data.

library(summarytools)
summarytools::st_options("footnote", NA)

tidy_data %>%
  select(archive, records_type, contributor, record_group, tags) %>%
  summarytools::dfSummary(style = "grid", plain.ascii = FALSE) %>%
  print(method = "render", omit.headings = TRUE)
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 archive [character] 1. National Archives at Coll 2. Electronic Records Archiv 3. NARA's Rocky Mountain Reg 4. National Archives Buildin 5. Harry S. Truman Library ( 6. Ronald Reagan Library (Si 7. Franklin D. Roosevelt Lib 8. NARA's Pacific Region (Ri 9. NARA's Southeast Region ( 10. Dwight D. Eisenhower Libr [ 3 others ] 3982 (73.0%) 1417 (26.0%) 22 (0.4%) 12 (0.2%) 8 (0.1%) 4 (0.1%) 2 (0.0%) 2 (0.0%) 2 (0.0%) 1 (0.0%) 3 (0.1%) 5455 (99.56%) 24 (0.44%)
2 records_type [character] 1. Photographs and other Gra 2. Textual Records 3965 (72.7%) 1492 (27.3%) 5457 (99.6%) 22 (0.4%)
3 contributor [character] 1. Albers, Clem 2. Clark, Fred, photographer 3. Cook, John 4. Iwasaki, Hikaru, 1923- 5. Lange, Dorothea, 1895-196 6. Lynn, Charles, War Reloca 7. Mace, Charles E., photogr 8. McClelland, Joe, War Relo 9. Parker, Tom 10. Stewart, Francis, War Rel 11. Van Tassel, Gretchen 381 (10.2%) 154 (4.1%) 26 (0.7%) 291 (7.8%) 872 (23.2%) 18 (0.5%) 93 (2.5%) 188 (5.0%) 707 (18.9%) 1002 (26.7%) 18 (0.5%) 3750 (68.44%) 1729 (31.56%)
4 record_group [character] 1. 127 2. 142 3. 148 4. 210 5. 407 6. 46 7. 549 8. 77 6 (0.1%) 1 (0.0%) 2 (0.1%) 3965 (99.7%) 1 (0.0%) 1 (0.0%) 1 (0.0%) 1 (0.0%) 3978 (72.6%) 1501 (27.4%)
5 tags [character] 1. NA 2. MRC-tg1 3. tghr-tg1 4. Departure 5. RoadTripDenver 6. jwric-ts1, cart18-rmdv-ts 7. jwric-ts1 8. Japanese Internment 9. strawberries 10. american flag, pledge of [ 233 others ] 5147 (93.9%) 31 (0.6%) 22 (0.4%) 8 (0.1%) 8 (0.1%) 6 (0.1%) 5 (0.1%) 4 (0.1%) 3 (0.1%) 2 (0.0%) 243 (4.9%) 5479 (100%) 0 (0%)

Hmmm… interesting! Even this broad overview is providing a fascinating peek into the dataset. The “items” are mostly photographs, with the remaining 27% documents. Most of the photographs were taken by just a handful of people, Dorothea Lange being one of the more prolific.

Whew, just tidying the JSON data took up the whole blogpost. Hopefully I can get to data analysis in the next one.

Updated 2018-08-14; source code

comments powered by Disqus