R4DS_solutions

Lei Yan

2018/12/23

5.6.7 Exercises

library(tidyverse)
library(nycflights13)

Creat not_cancelled for problems use.

not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))

1.

I don’t like this question, so I choose to ignore it.

2.

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows
not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))
## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

3.

Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

Because there is no such flights which are cancelled at departure time but have non-NA arrival time. We can verify this using the code below.

flights %>%
  filter(is.na(dep_delay), !is.na(arr_delay))
## # A tibble: 0 x 19
## # ... with 19 variables: year <int>, month <int>, day <int>,
## #   dep_time <int>, sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

So, dep_delay is the most important column.

4.

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

flights %>%
  group_by(year, month, day) %>%
  summarise(cancelled = sum(is.na(dep_delay)),
            proportion = mean(is.na(dep_delay)),
            aver_dep = mean(dep_delay, na.rm = T),
            aver_arr = mean(arr_delay, na.rm = T)
            ) %>%
  ggplot(mapping = aes(x = proportion)) +
  geom_point(mapping = aes(y = aver_dep), color = 'blue', alpha = 0.5) +
  geom_point(mapping = aes(y = aver_arr), color = 'red', alpha = 0.5) + 
  ylab('average delay(min)')

5.

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

Let’s find out the carrier which has the worst delays.

flights %>%
  group_by(carrier) %>%
  summarise(max_dep = max(dep_delay, na.rm = T),
            max_arr = max(arr_delay, na.rm = T)) %>%
  arrange(desc(max_dep, max_arr)) %>% 
  top_n(1)
## # A tibble: 1 x 3
##   carrier max_dep max_arr
##   <chr>     <dbl>   <dbl>
## 1 HA         1301    1272

Use the hint:

flights %>% 
  group_by(carrier, dest) %>%
  summarise(n())
## # A tibble: 314 x 3
## # Groups:   carrier [?]
##    carrier dest  `n()`
##    <chr>   <chr> <int>
##  1 9E      ATL      59
##  2 9E      AUS       2
##  3 9E      AVL      10
##  4 9E      BGR       1
##  5 9E      BNA     474
##  6 9E      BOS     914
##  7 9E      BTV       2
##  8 9E      BUF     833
##  9 9E      BWI     856
## 10 9E      CAE       3
## # ... with 304 more rows

6.

What does the sort argument to count() do. When might you use it?

sort if TRUE will sort output in descending order of n