R for IR: An Introduction

Presented at INAIR 2018 | http://bit.ly/2DHSTEi

Brendan J. Dugan, M.A. | bjdugan@iu.edu

March 22, 2018

Introduction

  • Quick poll
  • Who I am
  • Learning Outcomes:
    • "[L]earn foundational concepts, syntax and conventions, and broad utility of…R, necessary for further learning and application a in a professional setting."
    • "[L]earn strategies for learning R, discover resources for further learning and reference, and hone troubleshooting skills…"
    • "[D]evelop a greater understanding of data manipulation, statistical testing, and basic data visualization using R."

Why R?

Against

  • No simple user interface, just (mostly) syntax
  • Wild West: conventions & approaches vary (e.g., tidyverse vs. base)
  • SPSS dominates social sciences
  • IR Professionals have various backgrounds

Why R?

For

  • Cost (compare Tableau, SPSS): time & effort
  • Community support
  • Increasing popularity (link)
  • Data manipulation
  • Statistics
  • Visualization
  • Dynamic reporting

Very Basics

Console

  • Destination for [1] output & > syntax
  • Command history (History panel)
  • CTRL+L to clear
  • getRversion()
  • help() or ?function
  • q()

Very Basics: Setup

Rstudio Setup & Panes

  • Source, Environment/History, Files/Plots/Help/Viewer
  • Tools > Global Options > General: opt out of saving .Rdata (session).
  • Code > Editing: All but Soft-Wrap Source option & CTRL+ENTER
  • Code > Display: margin 80 chars
  • Appearance: Rstudio theme, code theme

Very Basics: Directories

Directories

  • Folder structure on your computer
getwd() # where am I?
## [1] "C:/Users/bjdugan/Google Drive/RforIR"
dir() # what's here?
## [1] "coffee.jpg"          "data_structures.png" "R-3.4.4-win.exe"    
## [4] "RforIR.Rproj"        "RforIR_working.html" "RforIR_working.Rmd" 
## [7] "RStudio-1.1.442.exe" "state_NENC.rds"

Very Basics: Directories

Directories

  • setwd() sets a unique path
  • Projects (Rstudio feature) address setwd() issue by relative paths ("./")
ls() # lists objects in (global) environment
## character(0)

Very Basics: Environments

Environments

  • R treats most things as objects held in memory*
  • Environments dictate how and where R finds objects (scoping)
environment() 
## <environment: R_GlobalEnv>
environment(mean) # where does mean() fxn exist?
## <environment: namespace:base>

Very Basics: Environments

Environments

  • Variables & data typically exist in Global Environment, functions in packages (base, utils, stats, psych, datasets)
  • Packages are sets of related functions (package::function())
  • install.packages("package"); library("package")
  • installed.packages()

R >> Calculator: Math & Logic Operators

  • +, -, *, /
  • a^b
  • 5 %% 2 (modulus)
  • 5 %/% 2.3 (integer division)
  • PEMDAS: ((a-b)/c)^d
  • 1 > 0
  • 99 <= 100
  • "dog" != "cat"
  • "dog" == "dog"
T == TRUE
## [1] TRUE

R >> Calculator: Objects

  • "Everything that exists [in R] is an object. Everything that happens is a function call." - John Chambers
  • Assignment operator <- (CTRL+'-')
x <- 5 # equivalent to x = 5, but clearer
5 -> x # works also
x - 1
## [1] 4
(x <- x - 1) # paren enclosure prints output
## [1] 4

Data Characteristics: Classes

  • Classes define basic characteristics of data
  • Logical < Integer < Numeric < Complex < Character (LINCC)
class(x)
## [1] "numeric"
is.logical(x)
## [1] FALSE
is.integer(x) # x <- 5L
## [1] FALSE
is.character(x)
## [1] FALSE

Data Characteristics: Continued

  • attributes(), dim(), names() all provide further information on objects where applicable
str(x) # class/type and values; quick data summary
##  num 4
length(x)
## [1] 1
# attributes(state.x77)
# str(state.x77)
# dim(state.x77)

Data Characteristics: Character Strings

z <- "Dogs"
substr(z, 1, 2) # position or index reference
## [1] "Do"
tolower(z)
## [1] "dogs"
nchar(z) # Why nchar() > length()
## [1] 4
grep("A", LETTERS, value = FALSE) # regular expressions
## [1] 1

Data Structures

  • atomic vectors & lists* (1 x j)
  • matrices & data frames* (i x j)

Data Structures: Related Functions

Data Structures: Vectors

  • Vectors are unidimensional collections of data sharing the same type (coercion)
  • Vector indicies allow extraction of Nth element (except character objects)
y <- 10:15
y[5]
## [1] 14
y[-1] 
## [1] 11 12 13 14 15
z[3]
## [1] NA

Data Structures: Lists

  • Lists are generic vectors that can contain mixed types, other lists (recursive)
y <- list(1, 2, "a", "b")
str(y)
## List of 4
##  $ : num 1
##  $ : num 2
##  $ : chr "a"
##  $ : chr "b"
(y <- list(first = 1:5, second = c("a", "b", "c")))
## $first
## [1] 1 2 3 4 5
## 
## $second
## [1] "a" "b" "c"

Data Structures: Factors

  • Factors are vectors with attributes or metadata
  • Nominal and ordinal data are best captured as factors, not simply character vectors
z <- factor(c("Dogs", "Dogs", "Cats", "Mice"))
levels(z) # levels are automatically detected but can be specified
## [1] "Cats" "Dogs" "Mice"
table(z) 
## z
## Cats Dogs Mice 
##    1    2    1

Break

Reading & Exploring Data

  • Understanding 'building blocks' of 2D data structures (datasets)
  • Object held in memory <- data on HDD, SQL pull, web scrape, etc.
  • Some data come with R (mtcars), but our data are often not simple matrices
  • read.csv() & read.table (write. also)
  • foreign::read.spss(path, use.value.labels = TRUE, to.data.frame = TRUE)
  • readxl::read_excel() or xlsx::read.xlsx()
  • saveRDS() & readRDS() are much faster than to read.spss(), but R-specific

Reading & Exploring Data

  • From data.frame(), Data frames are "tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software."
  • Notation for referring to rows, columns, cells, or ranges:
  • df[1, 1] using numeric indices for rows, columns
  • df[case, variable] using named indices for rows, columns
  • No more AA279!
  • Often only interested in subgroups or handful of variables

Reading & Exploring Data

  • US State Facts & Figures data, Statistical Abstract of the United States:1977, datasets::state.x77
stateDF <- as.data.frame(state.x77) # convert matrix to DF in Global.
str(stateDF) # also View(stateDF) or simply object name: stateDF
## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...

Reading & Exploring Data

  • nrow(), ncol(), provide structural information, like dim()
length(stateDF)
## [1] 8
colnames(stateDF)[1:3] # vector extraction works with many basic functions
## [1] "Population" "Income"     "Illiteracy"
rownames(stateDF)[1:3] 
## [1] "Alabama" "Alaska"  "Arizona"

Reading & Exploring Data

Descriptive Statistics & Subgroups

summary(stateDF$Illiteracy) # $ operator for single variable or object in a list
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   0.625   0.950   1.170   1.575   2.800
summary(stateDF[ , c("Population", "HS Grad", "Illiteracy")]) 
##    Population       HS Grad        Illiteracy   
##  Min.   :  365   Min.   :37.80   Min.   :0.500  
##  1st Qu.: 1080   1st Qu.:48.05   1st Qu.:0.625  
##  Median : 2838   Median :53.25   Median :0.950  
##  Mean   : 4246   Mean   :53.11   Mean   :1.170  
##  3rd Qu.: 4968   3rd Qu.:59.15   3rd Qu.:1.575  
##  Max.   :21198   Max.   :67.30   Max.   :2.800

Reading & Exploring Data

Descriptive Statistics & Subgroups

  • If rownames are declared, we can directly refer to cases
  • If state (unique case ID) is variable, this gets trickier:
  • stateDF[stateDF$state == "Indiana", c("Population", "HS Grad", "Illiteracy")]
stateDF["Indiana", c("Population", "HS Grad", "Illiteracy")] 
##         Population HS Grad Illiteracy
## Indiana       5313    52.9        0.7

Reading & Exploring Data

Descriptive Statistics & Subgroups

  • Several methods for subsetting data
stateDF_HShi <- subset(stateDF, `HS Grad` >= 53.25) # will eval. NA to FALSE
nrow(stateDF_HShi)
## [1] 25
summary(stateDF_HShi[, c("Population", "HS Grad", "Illiteracy")])
##    Population       HS Grad        Illiteracy   
##  Min.   :  365   Min.   :53.30   Min.   :0.500  
##  1st Qu.:  746   1st Qu.:57.10   1st Qu.:0.600  
##  Median : 1203   Median :59.20   Median :0.600  
##  Mean   : 2624   Mean   :59.52   Mean   :0.876  
##  3rd Qu.: 2861   3rd Qu.:62.60   3rd Qu.:1.100  
##  Max.   :21198   Max.   :67.30   Max.   :2.200

Reading & Exploring Data

Descriptive Statistics & Subgroups

stateDF_HShi <- stateDF[stateDF$`HS Grad` >= 53.25, ] # preferred; Grave `` needed with $
nrow(stateDF_HShi)
## [1] 25
summary(stateDF_HShi[, c("Population", "HS Grad", "Illiteracy")])
##    Population       HS Grad        Illiteracy   
##  Min.   :  365   Min.   :53.30   Min.   :0.500  
##  1st Qu.:  746   1st Qu.:57.10   1st Qu.:0.600  
##  Median : 1203   Median :59.20   Median :0.600  
##  Mean   : 2624   Mean   :59.52   Mean   :0.876  
##  3rd Qu.: 2861   3rd Qu.:62.60   3rd Qu.:1.100  
##  Max.   :21198   Max.   :67.30   Max.   :2.200

Reading & Exploring Data

Descriptive Statistics & Subgroups

stateDF_HShi_40k <- stateDF[stateDF$`HS Grad` >= median(stateDF$`HS Grad`) &
                               stateDF$Income >= 4000, ] 
nrow(stateDF_HShi_40k)
## [1] 22
summary(stateDF_HShi_40k[, c("Population", "HS Grad", "Illiteracy")])
##    Population         HS Grad        Illiteracy    
##  Min.   :  365.0   Min.   :53.30   Min.   :0.5000  
##  1st Qu.:  762.5   1st Qu.:57.73   1st Qu.:0.6000  
##  Median : 1878.0   Median :59.40   Median :0.6000  
##  Mean   : 2860.7   Mean   :60.05   Mean   :0.8364  
##  3rd Qu.: 3040.2   3rd Qu.:62.83   3rd Qu.:1.0500  
##  Max.   :21198.0   Max.   :67.30   Max.   :1.9000

Reading & Exploring Data

Adding Variables

  • New variables can be added directly if vector is as long as DF
  • Simple way to add calculated variables is with apply(data, margin, fun) family
  • apply() are generally preferred to for() loops (vectorized)
  • Not all calculated variables or fields need to be made with apply()

Reading & Exploring Data

Adding calculated variables

stateDF$PopulationR <- stateDF$Population * 1000 # Revert to actual estimate, from 1k's
stateDF[1:5, c("Population", "PopulationR", "HS Grad", "Illiteracy")]
##            Population PopulationR HS Grad Illiteracy
## Alabama          3615     3615000    41.3        2.1
## Alaska            365      365000    66.7        1.5
## Arizona          2212     2212000    58.1        1.8
## Arkansas         2110     2110000    39.9        1.9
## California      21198    21198000    62.6        1.1

Reading & Exploring Data

Adding variables using existing data

  • column- or row-binding joins lists or data frames of same length
stateDF <- cbind(state.region, stateDF) # state.region is another pre-loaded R dataset
str(stateDF$state.region)
##  Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
head(stateDF)
##            state.region Population Income Illiteracy Life Exp Murder
## Alabama           South       3615   3624        2.1    69.05   15.1
## Alaska             West        365   6315        1.5    69.31   11.3
## Arizona            West       2212   4530        1.8    70.55    7.8
## Arkansas          South       2110   3378        1.9    70.66   10.1
## California         West      21198   5114        1.1    71.71   10.3
## Colorado           West       2541   4884        0.7    72.06    6.8
##            HS Grad Frost   Area PopulationR
## Alabama       41.3    20  50708     3615000
## Alaska        66.7   152 566432      365000
## Arizona       58.1    15 113417     2212000
## Arkansas      39.9    65  51945     2110000
## California    62.6    20 156361    21198000
## Colorado      63.9   166 103766     2541000

Reading & Exploring Data

Missing Data

  • Missing data frequently occur in social & educational surveys, records
  • Many methods exist for handling missingness (multiple imputation, complete cases)
  • Some R functions automatically exclude missing data, while others fail
mean(stateDF$Area) # land area in square miles
## [1] 70735.88
mean(c(stateDF$Area, NA), na.rm = FALSE) # na.rm = TRUE
## [1] NA

Reading & Exploring Data

Missing Data

  • is.na() tests for NA values; complete.cases() identifies cases with no missing data
stateDF["Puerto Rico", ] <- NA # appending PR, with incomplete data
tail(stateDF)
##                state.region Population Income Illiteracy Life Exp Murder
## Virginia              South       4981   4701        1.4    70.08    9.5
## Washington             West       3559   4864        0.6    71.72    4.3
## West Virginia         South       1799   3617        1.4    69.48    6.7
## Wisconsin     North Central       4589   4468        0.7    72.48    3.0
## Wyoming                West        376   4566        0.6    70.29    6.9
## Puerto Rico            <NA>         NA     NA         NA       NA     NA
##               HS Grad Frost  Area PopulationR
## Virginia         47.8    85 39780     4981000
## Washington       63.5    32 66570     3559000
## West Virginia    41.6   100 24070     1799000
## Wisconsin        54.5   149 54464     4589000
## Wyoming          62.9   173 97203      376000
## Puerto Rico        NA    NA    NA          NA

Reading & Exploring Data

Missing Data

  • For more specific selection, df[!is.na(var), ] (only rows without NA on var will be kept)
stateDF <- stateDF[complete.cases(stateDF), ]
nrow(stateDF)
## [1] 50

Reading & Exploring Data

Missing Data

  • NAs are true missing values, different than NaN (Not a Number) and NULL
is.null(NA)
## [1] FALSE
is.nan(NA)
## [1] FALSE
0 / 0
## [1] NaN

Reading & Exploring Data

Missing Data

  • NA & NULL can be assigned as a value, but behave differently in data frames
stateDF$PopulationR <- NA
stateDF[1:4, "PopulationR"] 
## [1] NA NA NA NA
stateDF$PopulationR <- NULL 
colnames(stateDF) # our calculated field is now gone. 
## [1] "state.region" "Population"   "Income"       "Illiteracy"  
## [5] "Life Exp"     "Murder"       "HS Grad"      "Frost"       
## [9] "Area"

Reading & Exploring Data

Exporting data & Cleaning Environment

  • While most filter selections are done in data prep, we sometimes find a subset of interest in analysis
# Retain only states in Northeast and North Central
stateDF <- stateDF[stateDF$state.region == "Northeast" |
                     stateDF$state.region == "North Central", ]
stateDF$state.region <- droplevels(stateDF$state.region) # drop empty levels 
levels(stateDF$state.region)
## [1] "Northeast"     "North Central"
# save for later re-use
saveRDS(stateDF, "./state_NENC.rds")

Reading & Exploring Data

Exporting Data & Cleaning Environment

  • rm() lets us clean environment, but should be used carefully
  • Avoid rm(list=ls()), use broom icon if you must
ls() 
## [1] "stateDF"          "stateDF_HShi"     "stateDF_HShi_40k"
## [4] "x"                "y"                "z"
rm(stateDF, stateDF_HShi, stateDF_HShi_40k)
stateDF <- readRDS("./state_NENC.rds")
ls() # confirm
## [1] "stateDF" "x"       "y"       "z"

Control Flow

  • Control flow elements dictate task execution & iteration, i.e., automation
for (i in 1:5) {
  if (i %% 2 == 0) { 
    print(paste(i, "is an even number."))
  } else {print(paste(i, "is an odd number."))}
}
## [1] "1 is an odd number."
## [1] "2 is an even number."
## [1] "3 is an odd number."
## [1] "4 is an even number."
## [1] "5 is an odd number."
  • Also includes while(), repeat(), break(), next()
  • Loop over cases in data frame, files in diretory, etc., performing some function

Statistical Testing & Evaluation

Basic Statistical Functions

  • mean(), var(), sd(), min(), max(), median(), sqrt(), sum()
myTable <- table(stateDF$state.region, stateDF$Population >= 5000) # cross-tabulation
str(myTable)
##  'table' int [1:2, 1:2] 5 8 4 4
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "Northeast" "North Central"
##   ..$ : chr [1:2] "FALSE" "TRUE"
myTable
##                
##                 FALSE TRUE
##   Northeast         5    4
##   North Central     8    4

Statistical Testing & Evaluation

Basic Statistical Functions

as.row.percent <- function(data) { # create custom fuctions with function()
  # this function manipulates a table, adding row marginals and re-computing
  # the input table as row percentages. 

  # add column marginal
  x <- data.frame(cbind(data, 1:nrow(data))) 
  x[ , ncol(x)] <- 0 # impute 0 to nth column
  x[ , ncol(x)] <- apply(x, 1, sum) # sum across rows (1) of x
  colnames(x) <- c(colnames(data), "Row Total") 
  # convert to percentages
  for (i in 1:nrow(x)) { # for each row i in number of rows
    x[i,] <- round(x[i,] / x[i, ncol(x)] * 100, 2) 
  }
  print(x) # return output
}

Statistical Testing & Evaluation

Basic Statistical Functions

myTable
##                
##                 FALSE TRUE
##   Northeast         5    4
##   North Central     8    4
as.row.percent(myTable)
##               FALSE  TRUE Row Total
## Northeast     55.56 44.44       100
## North Central 66.67 33.33       100

Statistical Testing & Evaluation

Basic Statistical Tests

  • A formula can be specified by y ~ x or y ~ x1 + x2
t.test(stateDF$Illiteracy ~ stateDF$state.region)
## 
##  Welch Two Sample t-test
## 
## data:  stateDF$Illiteracy by stateDF$state.region
## t = 2.9592, df = 11.094, p-value = 0.01289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07709645 0.52290355
## sample estimates:
##     mean in group Northeast mean in group North Central 
##                         1.0                         0.7

Statistical Testing & Evaluation

Basic Statistical Tests

  • Test output can be assigned, accessed later - great for reporting
tTest_results <- t.test(stateDF$Illiteracy ~ stateDF$state.region)
tTest_results$method
## [1] "Welch Two Sample t-test"
tTest_results$statistic
##        t 
## 2.959182
t.test(stateDF$Illiteracy ~ stateDF$state.region)$p.value
## [1] 0.01288507

Statistical Testing & Evaluation

  • Linear modeling
model.lifeExp <- lm(stateDF$`Life Exp` ~ stateDF$state.region + stateDF$Income +
                      stateDF$Illiteracy + stateDF$`HS Grad`)
model.lifeExp # or summary(model.lifeExp)
## 
## Call:
## lm(formula = stateDF$`Life Exp` ~ stateDF$state.region + stateDF$Income + 
##     stateDF$Illiteracy + stateDF$`HS Grad`)
## 
## Coefficients:
##                       (Intercept)  stateDF$state.regionNorth Central  
##                         65.373396                          -0.141801  
##                    stateDF$Income                 stateDF$Illiteracy  
##                          0.001001                          -1.901108  
##                 stateDF$`HS Grad`  
##                          0.059609

Statistical Testing & Evaluation

## 
## Call:
## lm(formula = stateDF$`Life Exp` ~ stateDF$state.region + stateDF$Income + 
##     stateDF$Illiteracy + stateDF$`HS Grad`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62866 -0.61130  0.09229  0.46129  1.66916 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       65.3733963  4.1880107  15.610  4.2e-11
## stateDF$state.regionNorth Central -0.1418014  0.6407274  -0.221    0.828
## stateDF$Income                     0.0010011  0.0007151   1.400    0.181
## stateDF$Illiteracy                -1.9011084  1.7518014  -1.085    0.294
## stateDF$`HS Grad`                  0.0596087  0.0732969   0.813    0.428
##                                      
## (Intercept)                       ***
## stateDF$state.regionNorth Central    
## stateDF$Income                       
## stateDF$Illiteracy                   
## stateDF$`HS Grad`                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.847 on 16 degrees of freedom
## Multiple R-squared:  0.3458, Adjusted R-squared:  0.1822 
## F-statistic: 2.114 on 4 and 16 DF,  p-value: 0.1264

Data Visualization

#install.packages("ggplot2") # installs from CRAN; running again will re-install
library(ggplot2) # call to environment ggplot2 package, where ggplot() resides
ggplot(data = model.lifeExp) +
  geom_qq(aes(sample = model.lifeExp$residuals)) + # Quantile-Quantile plot
  geom_abline() # add a reference line

Data Visualization

# library(ggplot2)
# # Define custom colors for output; color hex codes mapped to factor levels
# NSSEcolors2 <- c("Northeast" = "#EFAA22",
#                  "North Central" = "#002D6D")
# ggplot(stateDF) + 
#   geom_point(mapping = aes(x = `HS Grad`, y = `Life Exp`,
#                                colour = state.region)) +
#   scale_color_manual(values = NSSEcolors2) + # add NSSE colors
#   labs("Life Expectancy and High School Graduation Rates, by Region") + # add title
#   theme_minimal()

Data Vizualization

Syntax Hygiene & Other Good Habits

I.e., what I wish I knew 2 years ago

  • Datasets ought to be tidy: 1 case per row, 1 variable per column
  • Object, variable names should be interpretable to someone else (or yourself next month!) with minimal effort
  • df$v1 is easy to type, and easy to confuse
  • Maintain spacing, newlines, and tabs for readability
  • # comment concisely & frequently throughout, descriptively at beginning
  • #### code sections #### help organize code & are collapsible
  • Revisit old projects and identify areas for improvements
  • Expand on the familiar: do it in Excel, try in R.
  • Regular practice, exposure, and persistence are as important as c0d1ng sk1ll5

Resources