Text Manipulation with Stringr

Written on August 10, 2022

Having clean, structured data is a great thing for any data scientist. Unfortunately, that scenario is almost never the case. In this post, we’ll take a look at cleaning and manipulating text data using the stringr package.

Data

For this post, we’ll use a dataset of player names from the 2021-22 NBA season. The following code pulls these names from basketball-reference (specifically the total stats table) and places them in a tibble.

library(rvest)
library(tidyverse)

# Pulling all player names for 2022 season
player_names <- read_html("https://www.basketball-reference.com/leagues/NBA_2022_totals.html") %>%
  html_node("#totals_stats") %>%
  html_table() %>%
  rename_all(.funs = ~tolower(.)) %>%
  distinct(player) %>%
  filter(player != "Player")

Before we get to anything too complex, we’ll get a taste of our first stringr function: str_to_lower. This will put all of our names in lower case. It’ll be easier to analyze our dataset with everything in the same case, so we can do it now before we get started!

player_names <- player_names %>%
  mutate(player = str_to_lower(player))

Package Background

There are a lot of different packages and base functions that can be used for text manipulation across R. stringr is a package that falls into the tidyverse collection of packages and will be loaded automatically if you call library(tidyverse).

The benefit of using stringr is that it creates a common syntax and grammer for string manipulation. Each of the functions start with str_ and their first input is a string vector.

It should be noted that when we’re talking about strings, we’re specifically referring to character data. These functions will not work directly on factor data.

Package Use Cases

There are a lot of string manipulation ideas captured by the stringr package, all of which can be broken out into a couple of general categories:

  • Matching
  • Subsetting
  • Length management
  • Mutation
  • Joining and splitting

We’ll take a look at a couple of useful functions that fall under each of these categories in this section.

This cheat sheet covers these various aspects of the stringr package, and is helpful to look back on and jog your memory.

Matching

We’ll start with the topic of string matching. Matching functions are useful for finding whether a given pattern is present in a string. Each function takes a pattern input that we’re looking to find within a string input. Let’s say that we wanted to find all players who had jr. in their name (names so nice, they had to be used twice). We could use these functions to match that jr. pattern to our player name vector.

Let’s touch on some of these more important matching functions:

  • str_detect looks through our string input and returns TRUE for any string that includes the pattern and FALSE for those that don’t. Here we can see that 22 players in our vector have jr. in their name
player_names %>%
  mutate(has_jr = str_detect(string = player, pattern = " jr.")) %>%
  # We'll summarize the output to get True / False counts
  group_by(has_jr) %>%
  count()
## # A tibble: 2 × 2
## # Groups:   has_jr [2]
##   has_jr     n
##   <lgl>  <int>
## 1 FALSE    583
## 2 TRUE      22
  • str_which looks through our string input and returns the indices of any elements in our vector that include the pattern
str_which(string = player_names$player, pattern = " jr.")
##  [1]  60  73  74  79  94  96 152 217 252 270 293 330 354 404 405 432 451 452 453
## [20] 510 547 568
  • str_locate finds the start and end position points of the pattern in any string input that includes it. Here we use str_detect to filter for players in our vector who have jr. in their name prior to str_locate (that way, we won’t get a bunch of NA results for players without jr. in their name)
player_names %>%
  # Filtering for players with jr. in their name
  filter(str_detect(string = player, pattern = " jr.") == T) %>%
  pull(player) %>%
  str_locate(pattern = " jr.")
##       start end
##  [1,]    15  18
##  [2,]    14  17
##  [3,]    15  18
##  [4,]    11  14
##  [5,]    13  16
##  [6,]    15  18
##  [7,]    11  14
##  [8,]    13  16
##  [9,]    13  16
## [10,]    14  17
## [11,]    14  17
## [12,]    11  14
## [13,]    14  17
## [14,]    12  15
## [15,]    12  15
## [16,]    12  15
## [17,]    13  16
## [18,]    15  18
## [19,]    12  15
## [20,]    13  16
## [21,]    11  14
## [22,]    17  20
  • str_count counts the number of times we see our pattern in each string input. Here we can see that 583 users don’t have any instance of the pattern and 22 have a single instance
player_names %>%
  mutate(jr_count = str_count(string = player, pattern = " jr.")) %>%
  # We'll summarize the output to get True / False counts
  group_by(jr_count) %>%
  count()
## # A tibble: 2 × 2
## # Groups:   jr_count [2]
##   jr_count     n
##      <int> <int>
## 1        0   583
## 2        1    22

Subsetting

Subsetting functions are either focused on filtering for a subset of our input vector or filtering for a subset of the actual text of our input vector.

Some major subsetting functions include:

  • str_sub takes start and end positional arguments and returns any text within those start and end points. This could, for example, be used to find the first x characters of each input string. It also takes negative position elements, which will look at our positions in reference to the end of the string
# Finding the first three characters of the first 10 names
player_names %>%
  .[1:10, ] %>%
  mutate(first_3 = str_sub(player, start = 1, end = 3))
## # A tibble: 10 × 2
##    player                   first_3
##    <chr>                    <chr>  
##  1 precious achiuwa         pre    
##  2 steven adams             ste    
##  3 bam adebayo              bam    
##  4 santi aldama             san    
##  5 lamarcus aldridge        lam    
##  6 nickeil alexander-walker nic    
##  7 grayson allen            gra    
##  8 jarrett allen            jar    
##  9 jose alvarado            jos    
## 10 justin anderson          jus
# Finding the last three characters of the first 10 names
player_names %>%
  .[1:10, ] %>%
  mutate(last_3 = str_sub(player, start = -3, end = -1))
## # A tibble: 10 × 2
##    player                   last_3
##    <chr>                    <chr> 
##  1 precious achiuwa         uwa   
##  2 steven adams             ams   
##  3 bam adebayo              ayo   
##  4 santi aldama             ama   
##  5 lamarcus aldridge        dge   
##  6 nickeil alexander-walker ker   
##  7 grayson allen            len   
##  8 jarrett allen            len   
##  9 jose alvarado            ado   
## 10 justin anderson          son
  • str_subset filters our string input for a set pattern. This could probably fall into the matching category, but I included it here because it returns the text with the pattern rather than just identifying if there is a match
# Finding any string that includes our jr. pattern
str_subset(string = player_names$player, pattern = " jr.")
##  [1] "brandon boston jr."   "charlie brown jr."    "chaundee brown jr."  
##  [4] "troy brown jr."       "vernon carey jr."     "wendell carter jr."  
##  [7] "david duke jr."       "tim hardaway jr."     "danuel house jr."    
## [10] "jaren jackson jr."    "derrick jones jr."    "kira lewis jr."      
## [13] "kenyon martin jr."    "larry nance jr."      "rj nembhard jr."     
## [16] "kelly oubre jr."      "kevin porter jr."     "michael porter jr."  
## [19] "otto porter jr."      "dennis smith jr."     "gary trent jr."      
## [22] "duane washington jr."

Length Management

Length management functions are fairly straightforward. They look at and manipulate how long our strings are.

Some useful ones include:

  • str_length gives the length of each string. This count includes white spaces, so it’s not a direct character count
# Finding the top 10 longest names
player_names %>%
  mutate(length = str_length(string = player)) %>%
  top_n(10, length) %>%
  arrange(desc(length))
## # A tibble: 10 × 2
##    player                   length
##    <chr>                     <int>
##  1 nickeil alexander-walker     24
##  2 kentavious caldwell-pope     24
##  3 shai gilgeous-alexander      23
##  4 timothé luwawu-cabarrot      23
##  5 thanasis antetokounmpo       22
##  6 jeremiah robinson-earl       22
##  7 quinndary weatherspoon       22
##  8 giannis antetokounmpo        21
##  9 sandro mamukelashvili        21
## 10 juan toscano-anderson        21
  • str_trim can be used to trim any trailing or leading whitespace while str_pad can be used to add characters so that strings are a constant width. We’ll look at these two combined, adding whitespace with str_pad and removing it with str_trim
player_names %>%
  .[1:10, ] %>%
  # Adding trailing whitespace until a set width of 20
  mutate(added_ws = str_pad(string = player, side = "right",
                            pad = " ", width = 20)) %>%
  # Removing added whitespace
  mutate(removed_ws = str_trim(string = player, side = "right")) %>%
  group_by(player) %>%
  # Counting string length for all players
  summarize(across(.cols = c(added_ws, removed_ws), .fns = ~ str_length(.)))
## # A tibble: 10 × 3
##    player                   added_ws removed_ws
##    <chr>                       <int>      <int>
##  1 bam adebayo                    20         11
##  2 grayson allen                  20         13
##  3 jarrett allen                  20         13
##  4 jose alvarado                  20         13
##  5 justin anderson                20         15
##  6 lamarcus aldridge              20         17
##  7 nickeil alexander-walker       24         24
##  8 precious achiuwa               20         16
##  9 santi aldama                   20         12
## 10 steven adams                   20         12

Mutation

This section focuses on changing the underlying characteristics of a string.

Some important functions are:

  • str_to_lower which we saw at the start of this post puts all characters in lower case while str_to_upper puts them in upper case
# Putting first 10 players' names into upper case
player_names %>%
  .[1:10, ] %>%
  mutate(player_upper = str_to_upper(string = player))
## # A tibble: 10 × 2
##    player                   player_upper            
##    <chr>                    <chr>                   
##  1 precious achiuwa         PRECIOUS ACHIUWA        
##  2 steven adams             STEVEN ADAMS            
##  3 bam adebayo              BAM ADEBAYO             
##  4 santi aldama             SANTI ALDAMA            
##  5 lamarcus aldridge        LAMARCUS ALDRIDGE       
##  6 nickeil alexander-walker NICKEIL ALEXANDER-WALKER
##  7 grayson allen            GRAYSON ALLEN           
##  8 jarrett allen            JARRETT ALLEN           
##  9 jose alvarado            JOSE ALVARADO           
## 10 justin anderson          JUSTIN ANDERSON
  • str_replace takes a specified pattern and replaces the first instance of it in the string input with some other pattern. str_replace_all does the same, but replaces all instances of the specified pattern rather than just the first instance
# Replacing all instances of jr. with ii
player_names %>%
  # Filtering for players with jr. in name
  filter(str_detect(string = player, pattern = " jr.") == T) %>%
  mutate(no_more_jr = str_replace(string = player, pattern = " jr.",
                                  replacement = " ii"))
## # A tibble: 22 × 2
##    player             no_more_jr       
##    <chr>              <chr>            
##  1 brandon boston jr. brandon boston ii
##  2 charlie brown jr.  charlie brown ii 
##  3 chaundee brown jr. chaundee brown ii
##  4 troy brown jr.     troy brown ii    
##  5 vernon carey jr.   vernon carey ii  
##  6 wendell carter jr. wendell carter ii
##  7 david duke jr.     david duke ii    
##  8 tim hardaway jr.   tim hardaway ii  
##  9 danuel house jr.   danuel house ii  
## 10 jaren jackson jr.  jaren jackson ii 
## # … with 12 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Removing all whitespace in names field
player_names %>%
  .[1:10, ] %>%
  mutate(no_ws = str_replace_all(string = player, pattern = " ",
                                 replacement = ""))
## # A tibble: 10 × 2
##    player                   no_ws                  
##    <chr>                    <chr>                  
##  1 precious achiuwa         preciousachiuwa        
##  2 steven adams             stevenadams            
##  3 bam adebayo              bamadebayo             
##  4 santi aldama             santialdama            
##  5 lamarcus aldridge        lamarcusaldridge       
##  6 nickeil alexander-walker nickeilalexander-walker
##  7 grayson allen            graysonallen           
##  8 jarrett allen            jarrettallen           
##  9 jose alvarado            josealvarado           
## 10 justin anderson          justinanderson

Joining and splitting

The last set of functions cover joining strings together or splitting them apart.

Some common functions include:

  • str_c can be used to combine two string vectors together. The sep argument specifies how the two vectors are separated when combined. There is also a collapse argument that, if set to a non-null value, will combine all vector elements into one string with its input used as a separator
# Adding a test string to the first 10 players' names
str_c(player_names$player[1:10], "test", sep = " ")
##  [1] "precious achiuwa test"         "steven adams test"            
##  [3] "bam adebayo test"              "santi aldama test"            
##  [5] "lamarcus aldridge test"        "nickeil alexander-walker test"
##  [7] "grayson allen test"            "jarrett allen test"           
##  [9] "jose alvarado test"            "justin anderson test"
# Combining the first 10 players into one string
str_c(player_names$player[1:10], collapse = ", ")
## [1] "precious achiuwa, steven adams, bam adebayo, santi aldama, lamarcus aldridge, nickeil alexander-walker, grayson allen, jarrett allen, jose alvarado, justin anderson"
  • str_split_fixed can be used to separate a string into a set number of parts based on a specific pattern. The result is a matrix where each column represents the specified splits. We can use str_split to return a list output with no set amount of breaks
# We'll split on the first white space to separate names into first and last names
str_split_fixed(string = player_names$player[1:10], pattern = " ",
                n = 2)
##       [,1]       [,2]              
##  [1,] "precious" "achiuwa"         
##  [2,] "steven"   "adams"           
##  [3,] "bam"      "adebayo"         
##  [4,] "santi"    "aldama"          
##  [5,] "lamarcus" "aldridge"        
##  [6,] "nickeil"  "alexander-walker"
##  [7,] "grayson"  "allen"           
##  [8,] "jarrett"  "allen"           
##  [9,] "jose"     "alvarado"        
## [10,] "justin"   "anderson"

Applying Our Knowledge

So far, this post has been pretty boring… I’ve just shown some of the functions I find most useful from the stringr package. Let’s spice things up a bit with an actual problem (or at least make it as spicy as string manipulation will allow).

We’ll start by looking at anagrams! What is an anagram, you ask? It’s a word or phrase that we can make from re-arranging the letters of another word or phrase. An example would be silent is an anagram for listen.

Now you might be wondering, what does this have to do with data science? Not much, but it’s actually a common interview question. In fact, I’ve experienced this question before, and though I can wax poetic about how AWFUL I think this is, it did inspire me to at least focus a bit more on string manipulation and share this blog post.

So, let’s start with the problem: I want to find if the word NBA is an anagram for any 3 character combination of our players’ names… I can already sense the eyes rolling into the back of your head, but don’t leave yet! You could be asked this on your next interview (even if you aren’t applying for a job in big anagram)!!!

We want to know if at any point in the player’s name, the word NBA appears, but remember: NBA does not have to appear in that set order.

We can start by combining our player names into one continuous set of characters. One wrench that I will throw in is that we want to exclude anything outside of a first or last name. As we can see, there are a handful of players who have a generational suffixes after their name. We will want to ignore these.

str_split_fixed(player_names$player, " ", n = 3) %>%
  as_tibble() %>%
  filter(V3 != "") %>%
  group_by(V3) %>%
  count(sort = T)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

## # A tibble: 5 × 2
## # Groups:   V3 [5]
##   V3        n
##   <chr> <int>
## 1 jr.      22
## 2 iii       5
## 3 ii        2
## 4 iv        2
## 5 sr.       1

Let’s manipulate our player names so we can more easily analyze them. The following code chunk does the following:

  • Splits player names into three sections
    • First name, last name, and any following suffices
  • Combines first and last names into one field with no spaces
  • Removes any punctuation from first and last name field

The output is stored in an object called anagram_df.

# Splitting names into 3 sections around blank spaces
anagram_df <- str_split_fixed(player_names$player, " ", n = 3) %>%
  as_tibble() %>%
  # Combining the first two splits while excluding the third split
  mutate(first_and_last = str_c(V1, V2, sep = "")) %>%
  # Removing any hyphens or BLANK from the new field
  mutate(first_and_last = str_replace_all(string = first_and_last,
                                          pattern = "-|'",
                                          replacement = "")) %>%
  bind_cols(player = player_names$player) %>%
  select(player, first_and_last)

Now we have all of our players’ first and last names combined into a string with no additional punctuation or white space.

From here, we’ll create a function that will take our player names and search through them to see if some input can act as an anagram for any section of their name.

The function, anagram_search, takes an input string and vector of player names and does the following:

  • Finds the length of the input string and player name
  • Breaks down player name by length of the input string
  • At each break, sorts the characters of the break and compares them to the sorted characters of the input string
    • Order of characters doesn’t matter when it comes to anagrams
  • If the sorted characters for any break is equal to the input string, a result of TRUE is returned
anagram_search <- function(input_string = "nba", player_name){
  # Going to loop through each input name and perform function
  result_vector <- c()
  for (i in 1:length(player_name)) {
    # Finding length of input string
    input_length <- str_length(input_string)
  
    # Finding length of player name
    name_length <- str_length(player_name[i])
    
    # Looping through each subset of the player name and seeing if any are an anagram for nba
    subset_result <- c()
    for (j in 1:name_length) {
      name_subset <- str_sub(player_name[i], start = j, end = j + (input_length - 1))
      
      # Sorting player name and input characters
      name_sorted <- sort(str_split(name_subset, pattern = "")[[1]])
      input_sorted <- sort(str_split(input_string, pattern = "")[[1]])
      
      # Seeing if sorted name and input are equal
      subset_result[j] <- str_c(name_sorted, collapse = "") == str_c(input_sorted, collapse = "")
      }
    
    result_vector[i] <- any(subset_result)
  }
  
  result_vector
}

With our function defined, let’s run it and see which players it returns as TRUE.

anagram_df %>%
  mutate(is_anagram = anagram_search(player_name = first_and_last)) %>%
  filter(is_anagram == T)
## # A tibble: 10 × 3
##    player            first_and_last   is_anagram
##    <chr>             <chr>            <lgl>     
##  1 marvin bagley iii marvinbagley     TRUE      
##  2 desmond bane      desmondbane      TRUE      
##  3 dalano banton     dalanobanton     TRUE      
##  4 harrison barnes   harrisonbarnes   TRUE      
##  5 jordan bell       jordanbell       TRUE      
##  6 bogdan bogdanović bogdanbogdanović TRUE      
##  7 bojan bogdanović  bojanbogdanović  TRUE      
##  8 drew eubanks      dreweubanks      TRUE      
##  9 boban marjanović  bobanmarjanović  TRUE      
## 10 yuta watanabe     yutawatanabe     TRUE

In total only 10 players in the 2022 NBA season had some section of their name be an anagram for NBA.

Although this example might have been a bit silly, it does put some of these stringr functions to good use. Now go off, and make all strings bow before your newfound manipulation capabilities!