class: center, middle, inverse, title-slide .title[ # Describing Data Graphically with R ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://psychmethods.github.io/coursenotes/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # Describing Data --- ## Hans Rosling
--- # Summarize .pull-left[ - Transform a pile of numbers into a summary - Descriptive Statistics - Distribution of a variable is a table/graph showing the categories/values of outcomes and their frequency/percentage of occurrence - Exploratory Data Analysis (Tukey, 1977) ] .pull-right[  ] --- # Exploratory Data Analysis .pull-left[ - Tukey (1977) - EDA - Graphical Data Analysis - Numbers as summaries - Emphasized Robust Statistics ] -- .pull-right[  ] --- # Descriptive Statistics - Examples - Tables - Graphs - Summary Statistics --- # Tables - Woodbridge (1845) <img src="data:image/png;base64,#../img/woodbridge1845.png" width="100%" style="display: block; margin: auto;" /> --- # Graphs - Minard (1869) <img src="data:image/png;base64,#../img/minard.png" width="70%" style="display: block; margin: auto;" /> --- # Examples - Summary Statistics - Measures of Central Tendency - Measures of Spread <img src="data:image/png;base64,#../img/centralbears.jpg" width="30%" style="display: block; margin: auto;" /> --- # Categorical Variable Displays (Nominal, Ordinal) - Frequency Distribution Graphs - Bar Chart (appropriate for nominal and ordinal data) - Pie Chart (best for nominal data with few categories) -- - Quantitative Variables - Histograms - Stem plots -- - Time Plots - (can be used for any level, often interval or ratio) --- # Frequency distribution graph .pull-left[ - Bar Chart - Graphs of variables with categories of outcomes on the x axis; and the frequency or percent of each category on the Y axis. - Appropriate for nominal and ordinal data ] .pull-right[ <img src="data:image/png;base64,#../img/oresme.jpg" width="35%" style="display: block; margin: auto;" /> .footnote[Nicole Oresme (Bishop of Lisieus) circa 1350] ] --- # Bar Graph/Chart .pull-left.small[ ``` r # Bar chart library(HistData) Minard_troops_demo <- Minard.troops Minard_troops_demo$group <- paste("Group",Minard_troops_demo$group) counts <- table(Minard_troops_demo$group) ``` ] -- .pull-right.small[ ``` r barplot(counts, main="Bar Chart of Troops", xlab="Group", ylab="Observations", col="lightblue") ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> ] -- - Note: This example uses nominal data (Troop Group) and how many data points we have for each group. - Bar charts are suitable for nominal and ordinal data as they show frequency for discrete categories. --- # Stacked Bar Chart (Total Troops) .pull-left.small[ ``` r df <- data.frame( group = c("Group 1", "Group 2", "Group 3"), max = c(340000,60000,22000), min = c(4000,28000,6000)) head(df) ``` ``` ## group max min ## 1 Group 1 340000 4000 ## 2 Group 2 60000 28000 ## 3 Group 3 22000 6000 ``` ] .pull-right.small[ ``` r library(ggplot2) bp <- ggplot(df, aes(x="", y=max/1000, # to rescale fill = group))+ geom_bar(width = 1,stat = "identity") bp ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ] - Stacked bar charts can show the relationship between two categorical variables. --- # Pie Chart .pull-left[ - Graphs of variables with categories of outcomes as frequency or percent of each category in the pie. - Best for nominal data with few categories ] .pull-right[ <img src="data:image/png;base64,#../img/pieplay.jpg" width="90%" height="40%" style="display: block; margin: auto;" /> .footnote["A pie chart showing each state in the United States, part of Playfair's translation of A Statistical Account of the United States of America by D. F. Donnant." ]] --- # Pie chart .pull-left-narrow[ ``` r slices <- c(10,12, 4,16, 8) lbls <- c("US", "UK", "Australia", "Germany", "France") ``` ] .pull-right-wide[ ``` r pie(slices, labels = lbls, main="Pie Chart of Countries") ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Example 2 .small.pull-left[ ``` r mytable <- table(iris$Species) lbls <- paste(names(mytable), "\n", mytable, sep="") pie(mytable, labels = lbls, main="Pie Chart of Species\n (with sample sizes)") ``` <img src="data:image/png;base64,#descriptive_files/figure-html/example2-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-15-1.png" width="65%" style="display: block; margin: auto;" /> ] - Pie charts are best for nominal data with few categories, showing part-to-whole relationships. --- # Convert Bar Chart into Pie Chart .small.pull-left[ ``` r pie <- bp + coord_polar("y", start=0) pie ``` <img src="data:image/png;base64,#descriptive_files/figure-html/polar-1.png" width="90%" style="display: block; margin: auto;" /> ``` r pie + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) ``` <img src="data:image/png;base64,#descriptive_files/figure-html/polar-2.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-16-1.png" width="65%" style="display: block; margin: auto;" /><img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-16-2.png" width="65%" style="display: block; margin: auto;" /> ] [Additional Resources](http://www.sthda.com/english/wiki/ggplot2-pie-chart-quick-start-guide-r-software-and-data-visualization) --- # Quantitative Variables - Interval or Ratio Scales - Histograms - Stem plots - Time plots --- # Histogram - A histogram is a graphical representation of the distribution of numerical data. - Approximates a probability distribution - First described by Pearson in 1895. --- # Histogram .pull-left[ ``` r library(MASS) # load library variable<-cats$Bwt hist(variable) ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-17-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r #Convert to Imperial variable<-variable*2.2 hist(variable) ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-18-1.png" width="90%" style="display: block; margin: auto;" /> ] - Histograms are appropriate for interval and ratio data, showing the distribution of continuous variables. --- # Boxplot (Ordinal predictor, Ratio outcome) .pull-left[ - Boxplots can show the distribution of a ratio variable across categories of an ordinal variable. ] .pull-right.small[ ``` r boxplot(mpg ~ cyl, data=mtcars, main="MPG by Number of Cylinders", xlab="Number of Cylinders (Ordinal)", ylab="Miles Per Gallon (Ratio)") ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-19-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # Stemplot .pull-left-narrow[ - Sometimes called a stem and leaf diagram - A way to display data that splits the data into a stem and leaf. - The stem is the first digit of the number and the leaf is the second digit - Stem and leaf plots are useful for displaying the distribution of interval or ratio data, especially for smaller datasets. ] -- .pull-right-wide.small[ - Eruption duration is a ratio variable - Each leaf represents the ones digit - Each stem represents the tens digit ``` r # Stem and Leaf plot stem(faithful$eruptions,scale=1) ``` ``` ## ## The decimal point is 1 digit(s) to the left of the | ## ## 16 | 070355555588 ## 18 | 000022233333335577777777888822335777888 ## 20 | 00002223378800035778 ## 22 | 0002335578023578 ## 24 | 00228 ## 26 | 23 ## 28 | 080 ## 30 | 7 ## 32 | 2337 ## 34 | 250077 ## 36 | 0000823577 ## 38 | 2333335582225577 ## 40 | 0000003357788888002233555577778 ## 42 | 03335555778800233333555577778 ## 44 | 02222335557780000000023333357778888 ## 46 | 0000233357700000023578 ## 48 | 00000022335800333 ## 50 | 0370 ``` ] --- # Time Plots <img src="data:image/png;base64,#../img/minard.png" width="30%" style="display: block; margin: auto;" /> -- - Edward Tufte has said that Minard's plot: > "may well be the best statistical graphic ever drawn" - It packs a ton of information into one dense figure. --- # Time Plots <img src="data:image/png;base64,#../img/minard.png" width="90%" style="display: block; margin: auto;" /> --- - The plot contains seven variables, each mapped to a different aesthetic: | Information | Aesthetic | Level of Measurement | |---------------------------------------|----------------------------------------| | Size of Napoleon's Grande Armée | Width of path | Ratio | | Longitude of the army's position | x-axis | Ratio | | Latitude of the army's position | y-axis | Ratio | | Direction of the army's movement | Color of path | Nominal | | Date of points along retreat path | Text below plot | ??????? (What do you think?) | | Temperature during the army's retreat | Line below plot | ??????? (What do you think?) | | Geographic features | Map background | ??????? (What do you think?) | --- # Recreation in R - This plot has been recreated in R by: - [Andrew Heiss](https://www.andrewheiss.com/blog/2017/08/10/exploring-minards-1812-plot-with-ggplot2/) - [Michael Friendly](http://www.datavis.ca/gallery/re-minard.php) - [Hadley Wickham](https://www.tandfonline.com/doi/suppl/10.1198/jcgs.2009.07098?scroll=top) --- # Side by Side .pull-left[ <img src="data:image/png;base64,#../img/minard.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-24-1.png" width="90%" style="display: block; margin: auto;" /> ] --- .tiny[ ``` r library(tidyverse) library(lubridate) library(ggplot2) library(ggmap) library(ggrepel) library(gridExtra) library(psych) #Download Directly download=FALSE # set to true to download if(download){ cities <- read.table("https://raw.githubusercontent.com/andrewheiss/fancy-minard/master/input/minard/cities.txt", header = TRUE, stringsAsFactors = FALSE) troops <- read.table("https://raw.githubusercontent.com/andrewheiss/fancy-minard/master/input/minard/troops.txt", header = TRUE, stringsAsFactors = FALSE) temps <- read.table("https://raw.githubusercontent.com/andrewheiss/fancy-minard/master/input/minard/temps.txt", header = TRUE, stringsAsFactors = FALSE) }else{ #Load from local cities <- read.table("dat/cities.txt", header = TRUE, stringsAsFactors = FALSE) troops <- read.table("dat/troops.txt", header = TRUE, stringsAsFactors = FALSE) temps <- read.table("dat/temps.txt", header = TRUE, stringsAsFactors = FALSE) } describe(cities) ``` ``` ## vars n mean sd median trimmed mad min max range ## long 1 20 30.79 4.08 30.30 30.77 4.74 24.0 37.6 13.6 ## lat 2 20 54.87 0.55 54.95 54.89 0.74 53.9 55.8 1.9 ## city* 3 20 10.50 5.92 10.50 10.50 7.41 1.0 20.0 19.0 ## skew kurtosis se ## long 0.17 -1.31 0.91 ## lat -0.20 -1.18 0.12 ## city* 0.00 -1.38 1.32 ``` ``` r describe(troops) ``` ``` ## vars n mean sd median trimmed mad ## long 1 51 28.97 4.39 28.3 28.56 5.49 ## lat 2 51 54.93 0.51 54.9 54.91 0.74 ## survivors 3 51 91217.65 101718.61 40000.0 72880.49 50408.40 ## direction* 4 51 1.51 0.50 2.0 1.51 0.00 ## group 5 51 1.43 0.70 1.0 1.29 0.00 ## min max range skew kurtosis se ## long 24.0 37.7 13.7 0.64 -0.88 0.61 ## lat 54.1 55.8 1.7 0.13 -1.29 0.07 ## survivors 4000.0 340000.0 336000.0 1.31 0.48 14243.45 ## direction* 1.0 2.0 1.0 -0.04 -2.04 0.07 ## group 1.0 3.0 2.0 1.27 0.15 0.10 ``` ``` r describe(temps) ``` ``` ## vars n mean sd median trimmed mad min max range ## long 1 9 30.63 4.30 29.2 30.63 4.15 25.3 37.6 12.3 ## temp 2 9 -15.67 11.10 -20.0 -15.67 13.34 -30.0 0.0 30.0 ## month* 3 9 1.89 0.78 2.0 1.89 1.48 1.0 3.0 2.0 ## day 4 9 14.56 9.46 14.0 14.56 11.86 1.0 28.0 27.0 ## date* 5 9 5.00 2.74 5.0 5.00 2.97 1.0 9.0 8.0 ## skew kurtosis se ## long 0.34 -1.56 1.43 ## temp 0.26 -1.66 3.70 ## month* 0.15 -1.54 0.26 ## day 0.06 -1.72 3.15 ## date* 0.00 -1.60 0.91 ``` ``` r temps$date <- as.Date(strptime(temps$date,"%d%b%Y")) temps.nice <- temps %>% mutate(nice.label = paste0(temp, "°, ", month, ". ", day)) march.1812.plot.simple <- ggplot() + geom_path(data = troops, aes(x = long, y = lat, group = group, color = direction, size = survivors), lineend = "round") + geom_point(data = cities, aes(x = long, y = lat), color = "#DC5B44") + geom_text_repel(data = cities, aes(x = long, y = lat, label = city), color = "#DC5B44") + scale_size(range = c(0.5, 10)) + scale_colour_manual(values = c("#DFC17E", "#252523")) + guides(color = FALSE, size = FALSE) + theme_nothing() march.1812.plot.simple ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-25-1.png" width="90%" style="display: block; margin: auto;" /> ``` r # Change the x-axis limits to match the simple map temps.1812.plot <- ggplot(data = temps.nice, aes(x = long, y = temp)) + geom_line() + geom_label(aes(label = nice.label), size = 2.5) + labs(x = NULL, y = "° Celsius") + scale_x_continuous(limits = ggplot_build(march.1812.plot.simple)$layout$panel_ranges[[1]]$x.range) + scale_y_continuous(position = "right") + coord_cartesian(ylim = c(-35, 5)) + # Add some space above/below theme_bw() + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), panel.border = element_blank()) temps.1812.plot ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-25-2.png" width="90%" style="display: block; margin: auto;" /> ``` r # Combine the two plots both.1812.plot.simple <- gtable_rbind(ggplotGrob(march.1812.plot.simple), ggplotGrob(temps.1812.plot)) both.1812.plot.simple ``` ``` ## TableGrob (32 x 13) "layout": 44 grobs ## z cells name ## 1 0 ( 1-16, 1-13) background ## 2 5 ( 8- 8, 6- 6) spacer ## 3 7 ( 9- 9, 6- 6) axis-l ## 4 3 (10-10, 6- 6) spacer ## 5 6 ( 8- 8, 7- 7) axis-t ## 6 1 ( 9- 9, 7- 7) panel ## 7 9 (10-10, 7- 7) axis-b ## 8 4 ( 8- 8, 8- 8) spacer ## 9 8 ( 9- 9, 8- 8) axis-r ## 10 2 (10-10, 8- 8) spacer ## 11 10 ( 7- 7, 7- 7) xlab-t ## 12 11 (11-11, 7- 7) xlab-b ## 13 12 ( 9- 9, 5- 5) ylab-l ## 14 13 ( 9- 9, 9- 9) ylab-r ## 15 14 ( 9- 9,11-11) guide-box-right ## 16 15 ( 9- 9, 3- 3) guide-box-left ## 17 16 (13-13, 7- 7) guide-box-bottom ## 18 17 ( 5- 5, 7- 7) guide-box-top ## 19 18 ( 9- 9, 7- 7) guide-box-inside ## 20 19 ( 4- 4, 7- 7) subtitle ## 21 20 ( 3- 3, 7- 7) title ## 22 21 (14-14, 7- 7) caption ## 23 0 (17-32, 1-13) background ## 24 5 (24-24, 6- 6) spacer ## 25 7 (25-25, 6- 6) axis-l ## 26 3 (26-26, 6- 6) spacer ## 27 6 (24-24, 7- 7) axis-t ## 28 1 (25-25, 7- 7) panel ## 29 9 (26-26, 7- 7) axis-b ## 30 4 (24-24, 8- 8) spacer ## 31 8 (25-25, 8- 8) axis-r ## 32 2 (26-26, 8- 8) spacer ## 33 10 (23-23, 7- 7) xlab-t ## 34 11 (27-27, 7- 7) xlab-b ## 35 12 (25-25, 5- 5) ylab-l ## 36 13 (25-25, 9- 9) ylab-r ## 37 14 (25-25,11-11) guide-box-right ## 38 15 (25-25, 3- 3) guide-box-left ## 39 16 (29-29, 7- 7) guide-box-bottom ## 40 17 (21-21, 7- 7) guide-box-top ## 41 18 (25-25, 7- 7) guide-box-inside ## 42 19 (20-20, 7- 7) subtitle ## 43 20 (19-19, 7- 7) title ## 44 21 (30-30, 7- 7) caption ## grob ## 1 zeroGrob[plot.background..zeroGrob.694] ## 2 zeroGrob[NULL] ## 3 absoluteGrob[GRID.absoluteGrob.688] ## 4 zeroGrob[NULL] ## 5 zeroGrob[NULL] ## 6 gTree[panel-1.gTree.686] ## 7 absoluteGrob[GRID.absoluteGrob.687] ## 8 zeroGrob[NULL] ## 9 zeroGrob[NULL] ## 10 zeroGrob[NULL] ## 11 zeroGrob[NULL] ## 12 zeroGrob[axis.title.x.bottom..zeroGrob.689] ## 13 zeroGrob[axis.title.y.left..zeroGrob.690] ## 14 zeroGrob[NULL] ## 15 zeroGrob[NULL] ## 16 zeroGrob[NULL] ## 17 zeroGrob[NULL] ## 18 zeroGrob[NULL] ## 19 zeroGrob[NULL] ## 20 zeroGrob[plot.subtitle..zeroGrob.692] ## 21 zeroGrob[plot.title..zeroGrob.691] ## 22 zeroGrob[plot.caption..zeroGrob.693] ## 23 rect[plot.background..rect.748] ## 24 zeroGrob[NULL] ## 25 zeroGrob[NULL] ## 26 zeroGrob[NULL] ## 27 zeroGrob[NULL] ## 28 gTree[panel-1.gTree.736] ## 29 absoluteGrob[GRID.absoluteGrob.737] ## 30 zeroGrob[NULL] ## 31 absoluteGrob[GRID.absoluteGrob.740] ## 32 zeroGrob[NULL] ## 33 zeroGrob[NULL] ## 34 zeroGrob[NULL] ## 35 zeroGrob[NULL] ## 36 titleGrob[axis.title.y.right..titleGrob.743] ## 37 zeroGrob[NULL] ## 38 zeroGrob[NULL] ## 39 zeroGrob[NULL] ## 40 zeroGrob[NULL] ## 41 zeroGrob[NULL] ## 42 zeroGrob[plot.subtitle..zeroGrob.745] ## 43 zeroGrob[plot.title..zeroGrob.744] ## 44 zeroGrob[plot.caption..zeroGrob.746] ``` ``` r # Adjust panels panels <- both.1812.plot.simple$layout$t[grep("panel", both.1812.plot.simple$layout$name)] # Because this plot doesn't use coord_equal, # since it's not a map, we can use whatever relative numbers we want, like a 3:1 ratio both.1812.plot.simple$heights[panels] <- unit(c(3, 1), "null") grid::grid.newpage() grid::grid.draw(both.1812.plot.simple) ``` <img src="data:image/png;base64,#descriptive_files/figure-html/unnamed-chunk-25-3.png" width="90%" style="display: block; margin: auto;" /> ] --- # More Accessible Resources - [R Graph Gallery](https://r-graph-gallery.com/) - [Learning Statistics with R](https://learningstatisticswithr.com/) - .hand-pink[[Data Science for Psychologists](https://datascience4psych.github.io/DataScience4Psych/)] --- class: center, middle # Wrapping Up...