Text Manipulation with Stringr
Having clean, structured data is a great thing for any data scientist.
Unfortunately, that scenario is almost never the case. In this post,
we’ll take a look at cleaning and manipulating text data using the
stringr
package.
Data
For this post, we’ll use a dataset of player names from the 2021-22 NBA season. The following code pulls these names from basketball-reference (specifically the total stats table) and places them in a tibble.
library(rvest)
library(tidyverse)
# Pulling all player names for 2022 season
player_names <- read_html("https://www.basketball-reference.com/leagues/NBA_2022_totals.html") %>%
html_node("#totals_stats") %>%
html_table() %>%
rename_all(.funs = ~tolower(.)) %>%
distinct(player) %>%
filter(player != "Player")
Before we get to anything too complex, we’ll get a taste of our first
stringr
function: str_to_lower
. This will put all of our names in
lower case. It’ll be easier to analyze our dataset with everything in
the same case, so we can do it now before we get started!
player_names <- player_names %>%
mutate(player = str_to_lower(player))
Package Background
There are a lot of different packages and base functions that can be
used for text manipulation across R. stringr
is a package that falls
into the tidyverse collection of packages and will be loaded
automatically if you call library(tidyverse)
.
The benefit of using stringr
is that it creates a common syntax and
grammer for string manipulation. Each of the functions start with str_
and their first input is a string vector.
It should be noted that when we’re talking about strings, we’re specifically referring to character data. These functions will not work directly on factor data.
Package Use Cases
There are a lot of string manipulation ideas captured by the stringr
package, all of which can be broken out into a couple of general
categories:
- Matching
- Subsetting
- Length management
- Mutation
- Joining and splitting
We’ll take a look at a couple of useful functions that fall under each of these categories in this section.
This cheat
sheet
covers these various aspects of the stringr
package, and is helpful to
look back on and jog your memory.
Matching
We’ll start with the topic of string matching. Matching functions are
useful for finding whether a given pattern is present in a string. Each
function takes a pattern
input that we’re looking to find within a
string
input. Let’s say that we wanted to find all players who had
jr.
in their name (names so nice, they had to be used twice). We could
use these functions to match that jr.
pattern to our player name
vector.
Let’s touch on some of these more important matching functions:
str_detect
looks through our string input and returnsTRUE
for any string that includes the pattern andFALSE
for those that don’t. Here we can see that22
players in our vector havejr.
in their name
player_names %>%
mutate(has_jr = str_detect(string = player, pattern = " jr.")) %>%
# We'll summarize the output to get True / False counts
group_by(has_jr) %>%
count()
## # A tibble: 2 × 2
## # Groups: has_jr [2]
## has_jr n
## <lgl> <int>
## 1 FALSE 583
## 2 TRUE 22
str_which
looks through our string input and returns the indices of any elements in our vector that include the pattern
str_which(string = player_names$player, pattern = " jr.")
## [1] 60 73 74 79 94 96 152 217 252 270 293 330 354 404 405 432 451 452 453
## [20] 510 547 568
str_locate
finds the start and end position points of the pattern in any string input that includes it. Here we usestr_detect
to filter for players in our vector who havejr.
in their name prior tostr_locate
(that way, we won’t get a bunch ofNA
results for players withoutjr.
in their name)
player_names %>%
# Filtering for players with jr. in their name
filter(str_detect(string = player, pattern = " jr.") == T) %>%
pull(player) %>%
str_locate(pattern = " jr.")
## start end
## [1,] 15 18
## [2,] 14 17
## [3,] 15 18
## [4,] 11 14
## [5,] 13 16
## [6,] 15 18
## [7,] 11 14
## [8,] 13 16
## [9,] 13 16
## [10,] 14 17
## [11,] 14 17
## [12,] 11 14
## [13,] 14 17
## [14,] 12 15
## [15,] 12 15
## [16,] 12 15
## [17,] 13 16
## [18,] 15 18
## [19,] 12 15
## [20,] 13 16
## [21,] 11 14
## [22,] 17 20
str_count
counts the number of times we see our pattern in each string input. Here we can see that583
users don’t have any instance of the pattern and22
have a single instance
player_names %>%
mutate(jr_count = str_count(string = player, pattern = " jr.")) %>%
# We'll summarize the output to get True / False counts
group_by(jr_count) %>%
count()
## # A tibble: 2 × 2
## # Groups: jr_count [2]
## jr_count n
## <int> <int>
## 1 0 583
## 2 1 22
Subsetting
Subsetting functions are either focused on filtering for a subset of our input vector or filtering for a subset of the actual text of our input vector.
Some major subsetting functions include:
str_sub
takesstart
andend
positional arguments and returns any text within those start and end points. This could, for example, be used to find the firstx
characters of each input string. It also takes negative position elements, which will look at our positions in reference to the end of the string
# Finding the first three characters of the first 10 names
player_names %>%
.[1:10, ] %>%
mutate(first_3 = str_sub(player, start = 1, end = 3))
## # A tibble: 10 × 2
## player first_3
## <chr> <chr>
## 1 precious achiuwa pre
## 2 steven adams ste
## 3 bam adebayo bam
## 4 santi aldama san
## 5 lamarcus aldridge lam
## 6 nickeil alexander-walker nic
## 7 grayson allen gra
## 8 jarrett allen jar
## 9 jose alvarado jos
## 10 justin anderson jus
# Finding the last three characters of the first 10 names
player_names %>%
.[1:10, ] %>%
mutate(last_3 = str_sub(player, start = -3, end = -1))
## # A tibble: 10 × 2
## player last_3
## <chr> <chr>
## 1 precious achiuwa uwa
## 2 steven adams ams
## 3 bam adebayo ayo
## 4 santi aldama ama
## 5 lamarcus aldridge dge
## 6 nickeil alexander-walker ker
## 7 grayson allen len
## 8 jarrett allen len
## 9 jose alvarado ado
## 10 justin anderson son
str_subset
filters our string input for a set pattern. This could probably fall into the matching category, but I included it here because it returns the text with the pattern rather than just identifying if there is a match
# Finding any string that includes our jr. pattern
str_subset(string = player_names$player, pattern = " jr.")
## [1] "brandon boston jr." "charlie brown jr." "chaundee brown jr."
## [4] "troy brown jr." "vernon carey jr." "wendell carter jr."
## [7] "david duke jr." "tim hardaway jr." "danuel house jr."
## [10] "jaren jackson jr." "derrick jones jr." "kira lewis jr."
## [13] "kenyon martin jr." "larry nance jr." "rj nembhard jr."
## [16] "kelly oubre jr." "kevin porter jr." "michael porter jr."
## [19] "otto porter jr." "dennis smith jr." "gary trent jr."
## [22] "duane washington jr."
Length Management
Length management functions are fairly straightforward. They look at and manipulate how long our strings are.
Some useful ones include:
str_length
gives the length of each string. This count includes white spaces, so it’s not a direct character count
# Finding the top 10 longest names
player_names %>%
mutate(length = str_length(string = player)) %>%
top_n(10, length) %>%
arrange(desc(length))
## # A tibble: 10 × 2
## player length
## <chr> <int>
## 1 nickeil alexander-walker 24
## 2 kentavious caldwell-pope 24
## 3 shai gilgeous-alexander 23
## 4 timothé luwawu-cabarrot 23
## 5 thanasis antetokounmpo 22
## 6 jeremiah robinson-earl 22
## 7 quinndary weatherspoon 22
## 8 giannis antetokounmpo 21
## 9 sandro mamukelashvili 21
## 10 juan toscano-anderson 21
str_trim
can be used to trim any trailing or leading whitespace whilestr_pad
can be used to add characters so that strings are a constant width. We’ll look at these two combined, adding whitespace withstr_pad
and removing it withstr_trim
player_names %>%
.[1:10, ] %>%
# Adding trailing whitespace until a set width of 20
mutate(added_ws = str_pad(string = player, side = "right",
pad = " ", width = 20)) %>%
# Removing added whitespace
mutate(removed_ws = str_trim(string = player, side = "right")) %>%
group_by(player) %>%
# Counting string length for all players
summarize(across(.cols = c(added_ws, removed_ws), .fns = ~ str_length(.)))
## # A tibble: 10 × 3
## player added_ws removed_ws
## <chr> <int> <int>
## 1 bam adebayo 20 11
## 2 grayson allen 20 13
## 3 jarrett allen 20 13
## 4 jose alvarado 20 13
## 5 justin anderson 20 15
## 6 lamarcus aldridge 20 17
## 7 nickeil alexander-walker 24 24
## 8 precious achiuwa 20 16
## 9 santi aldama 20 12
## 10 steven adams 20 12
Mutation
This section focuses on changing the underlying characteristics of a string.
Some important functions are:
str_to_lower
which we saw at the start of this post puts all characters in lower case whilestr_to_upper
puts them in upper case
# Putting first 10 players' names into upper case
player_names %>%
.[1:10, ] %>%
mutate(player_upper = str_to_upper(string = player))
## # A tibble: 10 × 2
## player player_upper
## <chr> <chr>
## 1 precious achiuwa PRECIOUS ACHIUWA
## 2 steven adams STEVEN ADAMS
## 3 bam adebayo BAM ADEBAYO
## 4 santi aldama SANTI ALDAMA
## 5 lamarcus aldridge LAMARCUS ALDRIDGE
## 6 nickeil alexander-walker NICKEIL ALEXANDER-WALKER
## 7 grayson allen GRAYSON ALLEN
## 8 jarrett allen JARRETT ALLEN
## 9 jose alvarado JOSE ALVARADO
## 10 justin anderson JUSTIN ANDERSON
str_replace
takes a specified pattern and replaces the first instance of it in the string input with some other pattern.str_replace_all
does the same, but replaces all instances of the specified pattern rather than just the first instance
# Replacing all instances of jr. with ii
player_names %>%
# Filtering for players with jr. in name
filter(str_detect(string = player, pattern = " jr.") == T) %>%
mutate(no_more_jr = str_replace(string = player, pattern = " jr.",
replacement = " ii"))
## # A tibble: 22 × 2
## player no_more_jr
## <chr> <chr>
## 1 brandon boston jr. brandon boston ii
## 2 charlie brown jr. charlie brown ii
## 3 chaundee brown jr. chaundee brown ii
## 4 troy brown jr. troy brown ii
## 5 vernon carey jr. vernon carey ii
## 6 wendell carter jr. wendell carter ii
## 7 david duke jr. david duke ii
## 8 tim hardaway jr. tim hardaway ii
## 9 danuel house jr. danuel house ii
## 10 jaren jackson jr. jaren jackson ii
## # … with 12 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Removing all whitespace in names field
player_names %>%
.[1:10, ] %>%
mutate(no_ws = str_replace_all(string = player, pattern = " ",
replacement = ""))
## # A tibble: 10 × 2
## player no_ws
## <chr> <chr>
## 1 precious achiuwa preciousachiuwa
## 2 steven adams stevenadams
## 3 bam adebayo bamadebayo
## 4 santi aldama santialdama
## 5 lamarcus aldridge lamarcusaldridge
## 6 nickeil alexander-walker nickeilalexander-walker
## 7 grayson allen graysonallen
## 8 jarrett allen jarrettallen
## 9 jose alvarado josealvarado
## 10 justin anderson justinanderson
Joining and splitting
The last set of functions cover joining strings together or splitting them apart.
Some common functions include:
str_c
can be used to combine two string vectors together. Thesep
argument specifies how the two vectors are separated when combined. There is also acollapse
argument that, if set to a non-null value, will combine all vector elements into one string with its input used as a separator
# Adding a test string to the first 10 players' names
str_c(player_names$player[1:10], "test", sep = " ")
## [1] "precious achiuwa test" "steven adams test"
## [3] "bam adebayo test" "santi aldama test"
## [5] "lamarcus aldridge test" "nickeil alexander-walker test"
## [7] "grayson allen test" "jarrett allen test"
## [9] "jose alvarado test" "justin anderson test"
# Combining the first 10 players into one string
str_c(player_names$player[1:10], collapse = ", ")
## [1] "precious achiuwa, steven adams, bam adebayo, santi aldama, lamarcus aldridge, nickeil alexander-walker, grayson allen, jarrett allen, jose alvarado, justin anderson"
str_split_fixed
can be used to separate a string into a set number of parts based on a specific pattern. The result is a matrix where each column represents the specified splits. We can usestr_split
to return a list output with no set amount of breaks
# We'll split on the first white space to separate names into first and last names
str_split_fixed(string = player_names$player[1:10], pattern = " ",
n = 2)
## [,1] [,2]
## [1,] "precious" "achiuwa"
## [2,] "steven" "adams"
## [3,] "bam" "adebayo"
## [4,] "santi" "aldama"
## [5,] "lamarcus" "aldridge"
## [6,] "nickeil" "alexander-walker"
## [7,] "grayson" "allen"
## [8,] "jarrett" "allen"
## [9,] "jose" "alvarado"
## [10,] "justin" "anderson"
Applying Our Knowledge
So far, this post has been pretty boring… I’ve just shown some of the
functions I find most useful from the stringr
package. Let’s spice
things up a bit with an actual problem (or at least make it as spicy as
string manipulation will allow).
We’ll start by looking at anagrams! What is an anagram, you ask? It’s a word or phrase that we can make from re-arranging the letters of another word or phrase. An example would be silent is an anagram for listen.
Now you might be wondering, what does this have to do with data science? Not much, but it’s actually a common interview question. In fact, I’ve experienced this question before, and though I can wax poetic about how AWFUL I think this is, it did inspire me to at least focus a bit more on string manipulation and share this blog post.
So, let’s start with the problem: I want to find if the word NBA
is an
anagram for any 3
character combination of our players’ names… I can
already sense the eyes rolling into the back of your head, but don’t
leave yet! You could be asked this on your next interview (even if you
aren’t applying for a job in big anagram)!!!
We want to know if at any point in the player’s name, the word NBA
appears, but remember: NBA
does not have to appear in that set order.
We can start by combining our player names into one continuous set of characters. One wrench that I will throw in is that we want to exclude anything outside of a first or last name. As we can see, there are a handful of players who have a generational suffixes after their name. We will want to ignore these.
str_split_fixed(player_names$player, " ", n = 3) %>%
as_tibble() %>%
filter(V3 != "") %>%
group_by(V3) %>%
count(sort = T)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 5 × 2
## # Groups: V3 [5]
## V3 n
## <chr> <int>
## 1 jr. 22
## 2 iii 5
## 3 ii 2
## 4 iv 2
## 5 sr. 1
Let’s manipulate our player names so we can more easily analyze them. The following code chunk does the following:
- Splits player names into three sections
- First name, last name, and any following suffices
- Combines first and last names into one field with no spaces
- Removes any punctuation from first and last name field
The output is stored in an object called anagram_df
.
# Splitting names into 3 sections around blank spaces
anagram_df <- str_split_fixed(player_names$player, " ", n = 3) %>%
as_tibble() %>%
# Combining the first two splits while excluding the third split
mutate(first_and_last = str_c(V1, V2, sep = "")) %>%
# Removing any hyphens or BLANK from the new field
mutate(first_and_last = str_replace_all(string = first_and_last,
pattern = "-|'",
replacement = "")) %>%
bind_cols(player = player_names$player) %>%
select(player, first_and_last)
Now we have all of our players’ first and last names combined into a string with no additional punctuation or white space.
From here, we’ll create a function that will take our player names and search through them to see if some input can act as an anagram for any section of their name.
The function, anagram_search
, takes an input string and vector of
player names and does the following:
- Finds the length of the input string and player name
- Breaks down player name by length of the input string
- At each break, sorts the characters of the break and compares them
to the sorted characters of the input string
- Order of characters doesn’t matter when it comes to anagrams
- If the sorted characters for any break is equal to the input string,
a result of
TRUE
is returned
anagram_search <- function(input_string = "nba", player_name){
# Going to loop through each input name and perform function
result_vector <- c()
for (i in 1:length(player_name)) {
# Finding length of input string
input_length <- str_length(input_string)
# Finding length of player name
name_length <- str_length(player_name[i])
# Looping through each subset of the player name and seeing if any are an anagram for nba
subset_result <- c()
for (j in 1:name_length) {
name_subset <- str_sub(player_name[i], start = j, end = j + (input_length - 1))
# Sorting player name and input characters
name_sorted <- sort(str_split(name_subset, pattern = "")[[1]])
input_sorted <- sort(str_split(input_string, pattern = "")[[1]])
# Seeing if sorted name and input are equal
subset_result[j] <- str_c(name_sorted, collapse = "") == str_c(input_sorted, collapse = "")
}
result_vector[i] <- any(subset_result)
}
result_vector
}
With our function defined, let’s run it and see which players it returns
as TRUE
.
anagram_df %>%
mutate(is_anagram = anagram_search(player_name = first_and_last)) %>%
filter(is_anagram == T)
## # A tibble: 10 × 3
## player first_and_last is_anagram
## <chr> <chr> <lgl>
## 1 marvin bagley iii marvinbagley TRUE
## 2 desmond bane desmondbane TRUE
## 3 dalano banton dalanobanton TRUE
## 4 harrison barnes harrisonbarnes TRUE
## 5 jordan bell jordanbell TRUE
## 6 bogdan bogdanović bogdanbogdanović TRUE
## 7 bojan bogdanović bojanbogdanović TRUE
## 8 drew eubanks dreweubanks TRUE
## 9 boban marjanović bobanmarjanović TRUE
## 10 yuta watanabe yutawatanabe TRUE
In total only 10
players in the 2022 NBA season had some section of
their name be an anagram for NBA.
Although this example might have been a bit silly, it does put some of
these stringr
functions to good use. Now go off, and make all strings
bow before your newfound manipulation capabilities!