Working with Character Strings

An often useful task is to manipulate character string variables. This usually comes in the form of regular expressions. Regular expressions come as a part of the base R, however, the regular expressions found in the stringr package are a bit more consistent in their naming structure, so we will use them (they are simply wrappers around the base R regular expressions).

The following packages will be used in this section of notes.

library(tidyverse)
# install.packages("stringr")
library(stringr)

Basic String Tasks

This section will discuss three basic string functions that help with simple string manipulations. These functions include: str_length, str_c, and str_sub.

str_length

The str_length function can be used to calculate the length of the string. For example:

string <- c('Iowa City', 'Cedar Rapids', 'Des Moines', 'IA')
str_length(string)
## [1]  9 12 10  2

str_c

The str_c function allows you to combine strings together in different ways. One way to think about this is to think about pasting strings together. For example:

str_c('Iowa City', 'Cedar Rapids', 'Des Moines', 'IA')
## [1] "Iowa CityCedar RapidsDes MoinesIA"

Perhaps more useful:

str_c(c('Iowa City', 'Cedar Rapids', 'Des Moines'), 'IA')
## [1] "Iowa CityIA"    "Cedar RapidsIA" "Des MoinesIA"

More useful yet:

str_c(c('Iowa City', 'Cedar Rapids', 'Des Moines'), 'IA', sep = ', ')
## [1] "Iowa City, IA"    "Cedar Rapids, IA" "Des Moines, IA"

You can also collapse multiple vectors of strings into a single string using the collapse argument.

str_c(c('Iowa City', 'Cedar Rapids', 'Des Moines'), collapse = ', ')
## [1] "Iowa City, Cedar Rapids, Des Moines"

str_sub

The str_sub function is useful for subsetting strings by location. For example:

str_sub(c('Iowa City', 'Cedar Rapids', 'Des Moines'), 1, 4)
## [1] "Iowa" "Ceda" "Des "

You can use negative numbers to start from the end:

str_sub(c('Iowa City', 'Cedar Rapids', 'Des Moines'), -6, -1)
## [1] "a City" "Rapids" "Moines"

Regular Expressions

Regular expressions are complicated and take awhile to master. This introduction is just going to cover the surface to get you started. To see the basics of regular expressions, we are going to use the str_view function to view text matches.

The most basic regular expression is simply to match literal text. For example:

x <- c('Iowa City', 'Cedar Rapids', 'Des Moines')
str_view(x, 'City')

Note that generally, regular expressions are case sensitive.

str_view(x, 'city')

If you want the expression to ignore case, use the ignore_case argument in tandem with regex.

str_view(x, regex('city', ignore_case = TRUE))

Two other useful regular expression tools are anchoring and repeating patterns. First, anchor refers to whether the match should occur anywhere (the default), match at the beginning of the string, or match at the end of the string. To match at the start of the string:

x <- c('Iowa City', 'Des Moines, Iowa')
str_view(x, '^Iowa')

Or to match at the end of a string:

str_view(x, 'Iowa$')

There are three operators that are useful for matching repetitious strings.

  • ? 0 or 1 match
  • + 1 or more
  • * 0 or more

Examples of these are given below:

sounds <- c('baaaa', 'ssss', 'moo', 'buzz', 'purr')
str_view(sounds, 'a?')
str_view(sounds, 'a+')
str_view(sounds, 'rrr*')
str_view(sounds, 'rrr+')

There are additional repetition operators using braces, {} that can be useful.

  • {n} match exactly n
  • {n, } match n or more
  • {, m} match at most m
  • {n, m} match between n and m

Exercises

  1. Using the str_view function and the sounds object created above, rewrite this regular expression using braces: str_view(sounds, 'rrr*').
  2. Explore the str_trim function. What does this do? Test this function on the following string: string <- "\n\nString with trailing and leading white space\n\n"

Regular Expression Functions

So far we have just visualized the regular expression match. This is useful for testing, however, commonly we would like to create a new variable based on information processed from text strings. The tools we will explore are: str_detect, str_count, str_extract, str_replace, and str_split.

Suppose we have the following string:

x <- c('Iowa City, Iowa', 'Cedar Rapids, IA', 'Des Moines, Iowa', 'Waterloo, IA', 'Rochester, Minnesota')
x
## [1] "Iowa City, Iowa"      "Cedar Rapids, IA"     "Des Moines, Iowa"    
## [4] "Waterloo, IA"         "Rochester, Minnesota"

Supose we were interested in knowing which cities are from Iowa in this text string, the str_detect function is useful for this.

str_detect(x, 'Iowa$')
## [1]  TRUE FALSE  TRUE FALSE FALSE

This didn’t return all the correct matches due to formatting differences. There are two options to fix this. First, we could search for two strings:

str_detect(x, 'Iowa$|IA$')
## [1]  TRUE  TRUE  TRUE  TRUE FALSE

We could then calculate the proportion of cities in the string directly:

mean(str_detect(x, 'Iowa$|IA$'))
## [1] 0.8

Another useful related function to str_detect is str_count which instead of TRUE/FALSE, will tell you how many matches are in each string.

str_count(x, 'Iowa$|IA$')
## [1] 1 1 1 1 0

There are instances where you will need to be careful with this function as it will calculate number of matches.

str_count(x, 'Iowa|IA')
## [1] 2 1 1 1 0

Replace Text

Above we solved the different formatting differences by searching for two text strings. This can be useful for a few different strings, however, for more complex searches, it can be useful to standardize the text to be the same across variables. This is the job for str_replace.

str_replace(x, 'Iowa$', 'IA')
## [1] "Iowa City, IA"        "Cedar Rapids, IA"     "Des Moines, IA"      
## [4] "Waterloo, IA"         "Rochester, Minnesota"

This function takes two arguments, first the text to be matched and second the text the match should be changed to. If there are no matches the text is not changed. You need to be careful with this function too:

str_replace(x, 'Iowa', 'IA')
## [1] "IA City, Iowa"        "Cedar Rapids, IA"     "Des Moines, IA"      
## [4] "Waterloo, IA"         "Rochester, Minnesota"

By default, the function will only replace the first match. If you’d like to replace all matches you need to use the str_replace_all function.

str_replace_all(x, 'Iowa', 'IA')
## [1] "IA City, IA"          "Cedar Rapids, IA"     "Des Moines, IA"      
## [4] "Waterloo, IA"         "Rochester, Minnesota"

This operation is not useful here, but there are many places that this is a useful operation.

Extract Text

If you wished to extract text instead of replacing text, str_extract is useful for this. For example, if we wished to extract the Minnesota:

str_extract(x, 'Minnesota')
## [1] NA          NA          NA          NA          "Minnesota"

You can build more complicated expressions using the str_extract function. For example, suppose we wished to extract only the city name.

str_extract(x, '^.*,')
## [1] "Iowa City,"    "Cedar Rapids," "Des Moines,"   "Waterloo,"    
## [5] "Rochester,"

This included the comma as well which may not be desired, we will show another way to achieve the same operation with the str_split function. One quick note about the above operation, I used a .. The . means to match any character (except a new line character). To match a literal ., you would need to escape this with \\..

Split on Delimiter

If you’d like to split a string based on a common delimiter, using the str_split function is useful. For example, if we wished to split the city from the state:

str_split(x, ', ')
## [[1]]
## [1] "Iowa City" "Iowa"     
## 
## [[2]]
## [1] "Cedar Rapids" "IA"          
## 
## [[3]]
## [1] "Des Moines" "Iowa"      
## 
## [[4]]
## [1] "Waterloo" "IA"      
## 
## [[5]]
## [1] "Rochester" "Minnesota"

The str_split function will remove the delimiter that it used to split on. The function also allows you to simplify the structure:

str_split(x, ', ', simplify = TRUE)
##      [,1]           [,2]       
## [1,] "Iowa City"    "Iowa"     
## [2,] "Cedar Rapids" "IA"       
## [3,] "Des Moines"   "Iowa"     
## [4,] "Waterloo"     "IA"       
## [5,] "Rochester"    "Minnesota"

Now a matrix is returned.

Real World Example

To give a sense of some real world applications of regular expressions, I’m going to use the “ufo.csv” data we used once previously.

ufo <- read_csv('https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv')
## Rows: 8031 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Date / Time, City, State, Shape, Duration, Summary, Posted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ufo
## # A tibble: 8,031 × 7
##    `Date / Time`  City                     State Shape   Duration Summary Posted
##    <chr>          <chr>                    <chr> <chr>   <chr>    <chr>   <chr> 
##  1 12/12/14 17:30 North Wales              PA    Triang… 5 minut… "I hea… <NA>  
##  2 12/12/14 12:40 Cartersville             GA    Unknown 3.6 min… "Looki… 12/12…
##  3 12/12/14 06:30 Isle of Man (UK/England) <NA>  Light   2 secon… "Over … 12/12…
##  4 12/12/14 01:00 Miamisburg               OH    Changi… <NA>     "Brigh… 12/12…
##  5 12/12/14 00:00 Spotsylvania             VA    Unknown 1 minute "White… 12/12…
##  6 12/11/14 23:25 Kenner                   LA    Chevron ~1 minu… "Stran… 12/12…
##  7 12/11/14 23:15 Eugene                   OR    Disk    2 minut… "Dual … 12/12…
##  8 12/11/14 20:04 Phoenix                  AZ    Chevron 3 minut… "4 Ora… 12/12…
##  9 12/11/14 20:00 Franklin                 NC    Disk    5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview                 WA    Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

A few things may be of interest here. First, we may wish to add columns that split the Duration variable into a time and metric variables.

ufo_duration <- str_split(ufo$Duration, ' ', simplify = TRUE)
cbind(ufo, ufo_duration) %>%
  head(n = 20)
##       Date / Time                     City State     Shape      Duration
## 1  12/12/14 17:30              North Wales    PA  Triangle     5 minutes
## 2  12/12/14 12:40             Cartersville    GA   Unknown   3.6 minutes
## 3  12/12/14 06:30 Isle of Man (UK/England)  <NA>     Light     2 seconds
## 4  12/12/14 01:00               Miamisburg    OH  Changing          <NA>
## 5  12/12/14 00:00             Spotsylvania    VA   Unknown      1 minute
## 6  12/11/14 23:25                   Kenner    LA   Chevron     ~1 minute
## 7  12/11/14 23:15                   Eugene    OR      Disk     2 minutes
## 8  12/11/14 20:04                  Phoenix    AZ   Chevron     3 minutes
## 9  12/11/14 20:00                 Franklin    NC      Disk     5 minutes
## 10 12/11/14 18:30                 Longview    WA  Cylinder    10 seconds
## 11 12/11/14 17:30                 Markesan    WI     Light    10 minutes
## 12 12/11/14 16:40               Birmingham    AL  Fireball    20 minutes
## 13 12/11/14 06:00             West Milford    NJ  Fireball    10 seconds
## 14 12/11/14 00:00             Williamsburg    VA       Egg    10 minutes
## 15 12/10/14 20:30                 Chandler    AZ    Sphere        1 hour
## 16 12/10/14 20:00                 Maricopa    AZ Formation 20-25 minutes
## 17 12/10/14 19:30          Litchfield Park    AZ Formation    20 minutes
## 18 12/10/14 19:15                  Flagler    CO     Light      1 minute
## 19 12/10/14 19:00                   Garner    NC     Light    12 minutes
## 20 12/10/14 17:30                  Ruidoso    NM  Fireball    20 minutes
##                                                                                                                                    Summary
## 1  I heard an extremely loud noise outside, and went onto my balcony to investigate. I saw an very very large green light headed my direct
## 2          Looking up towards the west I noticed an object that flashed from white to green to red.  ((NUFORC Note:  Possible star??  PD))
## 3                                                                       Over the Isle of Man, very fast moving light, diving then zooming.
## 4            Bright color changing and, shape shifting object seen over Miamisburg, OH.  ((NUFORC Note:  Possible "twinkling" star??  PD))
## 5                                                                White then orange orb gained a "tail of light" when chased off by a heli.
## 6                                                                            Strange, chevron-shaped, ufo moving east to west over Kenner.
## 7                                                                                          Dual orange orbs in Eugene, Oregon. 12/11/2014.
## 8                                                                                       4 Orange Lights Spotted South Of The Phoenix Area.
## 9             There were 5 or 6 lights in a row blinking, whites and reds.  It was just sitting there over top the ridge of the mountains.
## 10                                                                              Two cylinder shaped objects that flew parallel in the sky.
## 11                                                      Dark sky, large lights, nothing like an airplane, turning on and off in a pattern.
## 12                                                                                  UFOs moving fast like fireballs or individual rockets.
## 13                                                                                                               Strange light across sky.
## 14                                                                                       Bright light object with three clusters of light.
## 15                                  1-7 bright orange spheres seen for over an hour in Chandler, Arizona, near the Gila River Reservation.
## 16                                                                                                     Bright orange lights over Maricopa.
## 17                                                                                 Multiple lights in the sky in Litchfield Park, Arizona.
## 18                                                                                                                Eastern Colorado lights.
## 19                                                   Lights in distance quickly moving in every direction then shooting up at great speed.
## 20     1 lg. bright orange orb that split into 3 orbs.  Fighter jets chased them & they disappeared.  Mil. jets, helis, and a b2 followed.
##      Posted     1       2 3 4 5 6 7
## 1      <NA>     5 minutes          
## 2  12/12/14   3.6 minutes          
## 3  12/12/14     2 seconds          
## 4  12/12/14  <NA>                  
## 5  12/12/14     1  minute          
## 6  12/12/14    ~1  minute          
## 7  12/12/14     2 minutes          
## 8  12/12/14     3 minutes          
## 9  12/12/14     5 minutes          
## 10 12/12/14    10 seconds          
## 11 12/12/14    10 minutes          
## 12 12/12/14    20 minutes          
## 13 12/12/14    10 seconds          
## 14 12/12/14    10 minutes          
## 15 12/12/14     1    hour          
## 16 12/12/14 20-25 minutes          
## 17 12/12/14    20 minutes          
## 18 12/12/14     1  minute          
## 19 12/12/14    12 minutes          
## 20 12/12/14    20 minutes

It could also be useful to count the number of times colors were mentioned in the summary text.

ufo %>%
  mutate(
    num_colors = str_count(Summary, 'white|green|red|blue|orange|purple|yellow')
  )
## # A tibble: 8,031 × 8
##    `Date / Time`  City            State Shape Duration Summary Posted num_colors
##    <chr>          <chr>           <chr> <chr> <chr>    <chr>   <chr>       <int>
##  1 12/12/14 17:30 North Wales     PA    Tria… 5 minut… "I hea… <NA>            1
##  2 12/12/14 12:40 Cartersville    GA    Unkn… 3.6 min… "Looki… 12/12…          3
##  3 12/12/14 06:30 Isle of Man (U… <NA>  Light 2 secon… "Over … 12/12…          0
##  4 12/12/14 01:00 Miamisburg      OH    Chan… <NA>     "Brigh… 12/12…          0
##  5 12/12/14 00:00 Spotsylvania    VA    Unkn… 1 minute "White… 12/12…          1
##  6 12/11/14 23:25 Kenner          LA    Chev… ~1 minu… "Stran… 12/12…          0
##  7 12/11/14 23:15 Eugene          OR    Disk  2 minut… "Dual … 12/12…          1
##  8 12/11/14 20:04 Phoenix         AZ    Chev… 3 minut… "4 Ora… 12/12…          0
##  9 12/11/14 20:00 Franklin        NC    Disk  5 minut… "There… 12/12…          2
## 10 12/11/14 18:30 Longview        WA    Cyli… 10 seco… "Two c… 12/12…          0
## # … with 8,021 more rows

An Easier way to manipulate dates

The lubridate package in R makes working with date vectors much simpler.

# install.packages("lubridate")
library(lubridate)

First, we need to convert the Date/Time column in the ufo data to an actual date column. Note above that this column is actually a character vector. Fortunately, lubridate has some functions for common ways that dates and times are stored. The biggest hurdle to know which function to use, is to identify the pattern in the date/time column in our data. Below I print the first 6 rows of the date/time vector of data. Notice that the format is month/day/year then hour/minutes. We can use this information to parse the column to a date/time vector using lubridate’s built in date conversion tools.

head(ufo$`Date / Time`)
## [1] "12/12/14 17:30" "12/12/14 12:40" "12/12/14 06:30" "12/12/14 01:00"
## [5] "12/12/14 00:00" "12/11/14 23:25"

The primary way to determine which conversion tool to use, is to understand lubridate’s shorthand notation. Below is a list showing these elements.

For date components, these are the shorthand notation. * y = year * m = month * d = day

For time components, these are the shorthand notation. * h = hours * m = minutes * s = seconds

Note that “m” stands for both minute and month, but is used in context with either the date or time conversion. The lubridate package will handle this for us. Based on this table and the pattern depicted above, we can convert this with the following pattern and function: mdy_hm(). This can be read in English as, month, day, year followed by hour and minute.

ufo <- ufo %>%
  mutate(converted_date = mdy_hm(`Date / Time`))
## Warning: 56 failed to parse.
ufo
## # A tibble: 8,031 × 8
##    `Date / Time`  City   State Shape Duration Summary Posted converted_date     
##    <chr>          <chr>  <chr> <chr> <chr>    <chr>   <chr>  <dttm>             
##  1 12/12/14 17:30 North… PA    Tria… 5 minut… "I hea… <NA>   2014-12-12 17:30:00
##  2 12/12/14 12:40 Carte… GA    Unkn… 3.6 min… "Looki… 12/12… 2014-12-12 12:40:00
##  3 12/12/14 06:30 Isle … <NA>  Light 2 secon… "Over … 12/12… 2014-12-12 06:30:00
##  4 12/12/14 01:00 Miami… OH    Chan… <NA>     "Brigh… 12/12… 2014-12-12 01:00:00
##  5 12/12/14 00:00 Spots… VA    Unkn… 1 minute "White… 12/12… 2014-12-12 00:00:00
##  6 12/11/14 23:25 Kenner LA    Chev… ~1 minu… "Stran… 12/12… 2014-12-11 23:25:00
##  7 12/11/14 23:15 Eugene OR    Disk  2 minut… "Dual … 12/12… 2014-12-11 23:15:00
##  8 12/11/14 20:04 Phoen… AZ    Chev… 3 minut… "4 Ora… 12/12… 2014-12-11 20:04:00
##  9 12/11/14 20:00 Frank… NC    Disk  5 minut… "There… 12/12… 2014-12-11 20:00:00
## 10 12/11/14 18:30 Longv… WA    Cyli… 10 seco… "Two c… 12/12… 2014-12-11 18:30:00
## # … with 8,021 more rows

The resulting output has the converted_date added to the original table. Note, we did get some warning messages, these basically say that there were some dates that could not be converted properly, these are likely due to missing data or different patterns in the conversion. We would want to inspect these in more detail to understand why those 56 date/times failed to parse.

Once the dates are now in a date/time format, we can now use additional functions from lubridate to pull out specific elements of the date or time. For example, we could use year(), month(), day() to extract the year, month or day from each element. There are also similar functions, hour(), minute(), and second(). These are shown in use below.

ufo %>%
  mutate(
    year = year(converted_date),
    month = month(converted_date),
    day = day(converted_date),
    hour = hour(converted_date),
    minute = minute(converted_date),
    month_label_abbr = month(converted_date, label = TRUE),
    wday_abbr = wday(converted_date, label = TRUE),
    month_label = month(converted_date, label = TRUE, abbr = FALSE),
    wday = wday(converted_date, label = TRUE, abbr = FALSE)
  )
## # A tibble: 8,031 × 17
##    `Date / Time`  City   State Shape Duration Summary Posted converted_date     
##    <chr>          <chr>  <chr> <chr> <chr>    <chr>   <chr>  <dttm>             
##  1 12/12/14 17:30 North… PA    Tria… 5 minut… "I hea… <NA>   2014-12-12 17:30:00
##  2 12/12/14 12:40 Carte… GA    Unkn… 3.6 min… "Looki… 12/12… 2014-12-12 12:40:00
##  3 12/12/14 06:30 Isle … <NA>  Light 2 secon… "Over … 12/12… 2014-12-12 06:30:00
##  4 12/12/14 01:00 Miami… OH    Chan… <NA>     "Brigh… 12/12… 2014-12-12 01:00:00
##  5 12/12/14 00:00 Spots… VA    Unkn… 1 minute "White… 12/12… 2014-12-12 00:00:00
##  6 12/11/14 23:25 Kenner LA    Chev… ~1 minu… "Stran… 12/12… 2014-12-11 23:25:00
##  7 12/11/14 23:15 Eugene OR    Disk  2 minut… "Dual … 12/12… 2014-12-11 23:15:00
##  8 12/11/14 20:04 Phoen… AZ    Chev… 3 minut… "4 Ora… 12/12… 2014-12-11 20:04:00
##  9 12/11/14 20:00 Frank… NC    Disk  5 minut… "There… 12/12… 2014-12-11 20:00:00
## 10 12/11/14 18:30 Longv… WA    Cyli… 10 seco… "Two c… 12/12… 2014-12-11 18:30:00
## # … with 8,021 more rows, and 9 more variables: year <dbl>, month <dbl>,
## #   day <int>, hour <int>, minute <int>, month_label_abbr <ord>,
## #   wday_abbr <ord>, month_label <ord>, wday <ord>
Previous
Next