This is an analysis of rides data from from “Cyclistic”, a fictional bike-share company. Users remove a bike from one of the company’s many stations around Chicago, and then return it to the same station or any other station. This is a “ride”.

The company has two kinds of users, “casual” users and “members”. Members pay a yearly fee instead of paying by the ride or by the day.

The Question

How do annual members and casual riders use Cyclistic bikes differently?

The Data

The data is really data shared by a real bike share company. CSV files of data about individual trips was downloaded from here: https://divvy-tripdata.s3.amazonaws.com/index.html

Data is available under this licence: https://ride.divvybikes.com/data-license-agreement

We will use data from the last twelve complete months, December 2020 through November 2021.

To run this notebook:

  1. Create a subdirectory “data”
  2. Download zip files for December 2020 through November 2021 into data/1_raw_downloads
  3. Unzip files into data/2_unzipped to create files with paths like data/2_unzipped/202012-divvy-tripdata/202012-divvy-tripdata.csv

Load all the .csv’s into a single data frame:

library("tidyverse")
library("lubridate")
library("here")
load_a_month <- function(month_str) {
  read_csv(
    here("data", 
        "2_unzipped", 
         paste(month_str, "-divvy-tripdata", sep=""), 
         paste(month_str, "-divvy-tripdata.csv", sep="")
    ),
    show_col_types = FALSE
  )
}

df <- rbind(load_a_month("202012"), load_a_month("202101"), load_a_month("202102"),
            load_a_month("202103"), load_a_month("202104"), load_a_month("202105"), 
            load_a_month("202106"), load_a_month("202107"), load_a_month("202108"), 
            load_a_month("202109"), load_a_month("202110"), load_a_month("202111"))

print(paste("Total rows: ",nrow(df)))
## [1] "Total rows:  5479096"

Look at the column names.

names(df)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Check for duplicate rows

sum(duplicated(df))
## [1] 0

Check columns for missing values.

for (col in colnames(df)) {
  print(paste(col, sum(is.na(df[,col]))))
}
## [1] "ride_id 0"
## [1] "rideable_type 0"
## [1] "started_at 0"
## [1] "ended_at 0"
## [1] "start_station_name 651445"
## [1] "start_station_id 651442"
## [1] "end_station_name 698909"
## [1] "end_station_id 698909"
## [1] "start_lat 0"
## [1] "start_lng 0"
## [1] "end_lat 4738"
## [1] "end_lng 4738"
## [1] "member_casual 0"

The columns with no missing values and are sufficient for the analysis.Select them. Compute a column of ride durations (in minutes). And compute a day_of_week column.

rides <- df |> 
         select(ride_id, rideable_type, started_at, ended_at, start_lat, start_lng, member_casual) |>
         mutate(duration=(as.numeric(ended_at) - as.numeric(started_at))/60, day_of_week=wday(started_at))
rides

Make sure ride_id’s are unique.

rides$ride_id |> duplicated() |> sum()
## [1] 0

Some rides were recorded as ending before they started and have negative durations:

min(rides$duration)
## [1] -29049.97

Remove rides with negative duration. Customer told us those were due to maintenance activities.

rides <- rides|> filter(duration > 0)

The cleaned data:

rides

Trip Durations

Check for outliers and other strangeneess.

rides |> arrange(desc(duration)) |> select(started_at, ended_at, duration, member_casual)

Some “rides” were weeks long. But the longest were by casual users. We should ask the customer if these bikes maybe were not returned and eventually found, or something like that.

Checking members’ rides only, we see that their longest rides are much shorter. Just a few days.

rides |> arrange(desc(duration)) |> select(started_at, ended_at, duration, member_casual) |> filter(member_casual=='member')

Count rides longer than 4 hours:

nrow(rides |> filter(duration > 4*60))
## [1] 16500

Compare trip durations.

Summarize by Duration and Day of Week

We see that casual riders’ trips have a longer average duration and a longer maximum duration. And the most popular day for casual riders is Saturday. While the most popular day for members is Wednesday.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
day_names = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

rides |> group_by(member_casual) |> 
         summarize(mean_duration=mean(duration), 
                   max_duration=max(duration), 
                   most_popular_day=day_names[getmode(day_of_week)])

Analyze by Hour of Day and Day of Week

member_trips <- rides|> filter(member_casual=="member")
casual_trips <- rides|> filter(member_casual=="casual")

Number of casual users’ rides started in each hour

For casual rides, week day trips peak at evening commute time. Weekends have a smoother distribution and peak earlier. Ridership stays higher Friday and Saturday night and continues into the small hours of Saturday and Sunday morning.

Number of members’ rides started in each hour

Member trips have clear peaks at morning and evening commute times. However,the evening peak is always higher. Weekends have a smooth distribution which is highest in mid afternoon. Friday and Saturday nights have higher ridership which continues into the small hours of the next morning.

Analyze by Day of Week and by Month

Looking at casual trips by day of week and by month. Ridership is much lower during the cold months. Weekend ridership is higher than weekday ridership.

Member trips, by day of week and month. Members are more likely to ride all year, even in colder months. And in many months they ride more on weekdays than on weekends.

Investigate rideable_type

table(rides |> select(member_casual, rideable_type))
##              rideable_type
## member_casual classic_bike docked_bike electric_bike
##        casual      1258040      312344        918545
##        member      1962688        7801       1018604

Conclusions

Members and casual riders have similar patterns on the weekends. Rides have a smooth distribution over time with the peak in the middle of the afternoon. And members definitely do ride more in months with nicer weather than they do in the winter. So we can conclude that both groups ride recreationally on weekends.

Members weekday riding shows clear peaks at commute times. And our main conclusion is that members are more likely than casual riders to use Cyclistic bikes for commuting.

However, the evening commute peak is higher. This, plus the fact that member usage goes up in nicer months and down in colder months, suggests that members do not strictly depend on the bikes for their commute, and use them more after work when they have more time flexibility and/or are starting their evening’s recreation.

Casual riders’ weekday rides peak at the evening commute time (not at the weekend peak time) and have a small bump around the morning commute. So some casual riders are using Cyclistic bikes for commuting. And they might be a good target for campaigns to convert them into members.

Store summarized data

Summarize by day of week and hour of day and export this much smaller data set.

ride_counts <- 
  rides |>
  select(member_casual, started_at, duration) |>
  group_by(member_casual, start_hour=floor_date(started_at, "1 hour")) |>
  summarize(n=n(), avg_duration=mean(duration)) |>
  mutate(start_hour=as.character(start_hour), day_of_week=wday(start_hour, label=TRUE))
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
ride_counts

Save the result.

write_csv(ride_counts, "counts_by_hour_and_weekday.csv")

Summarize by month and export.

monthly_summary <- 
  rides |>
  mutate(month=month(started_at, label=TRUE)) |>
  select(member_casual, month, duration) |>
  group_by(member_casual, month) |>
  summarize(n=n(), avg_duration=mean(duration))
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
monthly_summary

Save the result.

write_csv(monthly_summary, "monthly_summary.csv")