You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your ecommendations, so they must be backed up with compelling data insights and professional data visualizations.
Guiding questions
What is the problem you are trying to solve?
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
How can your insights drive business decisions?
Identify the business task
Consider key stakeholders
Main stakeholders:
Cyclistic executive team
Lily Moreno
Secundary stakeholder:
Where is your data located?
How is the data organized?
Are there issues with bias or credibility in this data?
Reliable -Yes, the data is reliable. The data is a primary source data based on a fictional company.
Original - Yes, the original public data can be located.
Comprehensive - Yes, no vital information is missing.
Current - Yes, the data base is updated monyhly.
How are you addressing licensing, privacy, security, and accessibility?
How did you verify the data’s integrity?
How does it help you answer your question?
Are there any problems with the data?
library(tidyverse) # used to filter the data
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(lubridate) #used to work with date class.
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(Hmisc)
Carregando pacotes exigidos: lattice
Carregando pacotes exigidos: survival
Carregando pacotes exigidos: Formula
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
# loding the files name and
files <- fs::dir_ls(path = "../database/")
files
../database/202007-divvy-tripdata.csv ../database/202008-divvy-tripdata.csv
../database/202009-divvy-tripdata.csv ../database/202010-divvy-tripdata.csv
../database/202011-divvy-tripdata.csv ../database/202012-divvy-tripdata.csv
../database/202101-divvy-tripdata.csv ../database/202102-divvy-tripdata.csv
../database/202103-divvy-tripdata.csv ../database/202104-divvy-tripdata.csv
../database/202105-divvy-tripdata.csv ../database/202106-divvy-tripdata.csv
../database/202107-divvy-tripdata.csv ../database/202108-divvy-tripdata.csv
../database/202109-divvy-tripdata.csv ../database/202110-divvy-tripdata.csv
../database/202111-divvy-tripdata.csv ../database/202112-divvy-tripdata.csv
bikeshare_data <- vroom::vroom(files, col_names = TRUE)
Rows: 8081804 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bikeshare_data)
Rows: 8,081,804
Columns: 13
$ ride_id <chr> "762198876D69004D", "BEC9C9FBA0D4CF1B", "D2FD8EA432…
$ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
$ started_at <dttm> 2020-07-09 15:22:02, 2020-07-24 23:56:30, 2020-07-…
$ ended_at <dttm> 2020-07-09 15:25:52, 2020-07-25 00:20:17, 2020-07-…
$ start_station_name <chr> "Ritchie Ct & Banks St", "Halsted St & Roscoe St", …
$ start_station_id <chr> "180", "299", "329", "181", "268", "635", "113", "2…
$ end_station_name <chr> "Wells St & Evergreen Ave", "Broadway & Ridge Ave",…
$ end_station_id <chr> "291", "461", "156", "94", "301", "289", "140", "31…
$ start_lat <dbl> 41.90687, 41.94367, 41.93259, 41.89076, 41.91172, 4…
$ start_lng <dbl> -87.62622, -87.64895, -87.63643, -87.63170, -87.626…
$ end_lat <dbl> 41.90672, 41.98404, 41.93650, 41.91831, 41.90799, 4…
$ end_lng <dbl> -87.63483, -87.66027, -87.64754, -87.63628, -87.631…
$ member_casual <chr> "member", "member", "casual", "casual", "member", "…
bikeshare_data |>
is.na() |>
colSums()
ride_id rideable_type started_at ended_at
0 0 0 0
start_station_name start_station_id end_station_name end_station_id
785465 786088 849162 849623
start_lat start_lng end_lat end_lng
0 0 8137 8137
member_casual
0
The missing data are grouped at the location variable (station name, latitude and longitude)
Filtering and Process the data using the tools in the tidyverse.
In this fase we created the following variables:
trip_duration - the trip duration in minutes;
weekday_day - The day of the week the trip takes place;
is_weekend - Test if the day is a weekend;
date_month - Stores the month the trip takes place;
date_hour - Stores the hour the trip takes place;
date_season - Stores the season of the year;
day_time - Stores the time of the day;
trip_route - Stores the route of the trip (start station to end station).
Then we keep the following variable:
start_station_name;
ride_id;
rideable_type;
and member_casual.
the we exclude the remaning original variables.
then we change the class of the categorical variables to factor.
And finally, we filter the data to contain only trip duration longer than 0 minutes.
I chose not to exclude missing values due to being limited to location variables (station names and geographic markers), as well as excluding trips shorter than two minutes to minimize data collection errors.
# Filterring data.
bikeshare_data <- bikeshare_data |>
mutate(trip_duration = as.numeric(ended_at - started_at)/60, weekday_day = wday(started_at,
label = TRUE), is_weekend = ifelse((wday(started_at) == 7 | wday(started_at) ==
1), "yes", "no"), date_month = month(started_at, label = TRUE), date_hour = hour(started_at),
date_season = case_when(month(started_at) == 1 | month(started_at) == 2 |
month(started_at) == 3 ~ "winter", month(started_at) == 4 | month(started_at) ==
5 | month(started_at) == 6 ~ "spring", month(started_at) == 7 | month(started_at) ==
8 | month(started_at) == 9 ~ "summer", month(started_at) == 10 | month(started_at) ==
11 | month(started_at) == 12 ~ "fall"), day_time = case_when(hour(started_at) <
6 ~ "dawn", hour(started_at) >= 6 & hour(started_at) < 12 ~ "morning",
hour(started_at) >= 12 & hour(started_at) < 18 ~ "afternoon", hour(started_at) >=
18 ~ "night"), trip_route = str_c(start_station_name, end_station_name,
sep = " to ")) |>
relocate(start_station_name, .before = trip_route) |>
select(-(started_at:end_lng)) |>
mutate(is_weekend = factor(is_weekend, levels = c("yes", "no"), ordered = TRUE),
rideable_type = factor(rideable_type, levels = c("docked_bike", "electric_bike",
"classic_bike"), ordered = TRUE), member_casual = factor(member_casual,
levels = c("member", "casual"), ordered = TRUE), date_season = factor(date_season,
levels = c("winter", "spring", "summer", "fall"), ordered = TRUE), date_hour = factor(date_hour,
levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23), ordered = TRUE), day_time = factor(day_time,
levels = c("dawn", "morning", "afternoon", "night"), ordered = TRUE)) |>
filter(trip_duration > 2) # Only taking in account trips Higher than 2 minutes
glimpse(bikeshare_data)
Rows: 7,848,936
Columns: 21
$ ride_id <chr> "762198876D69004D", "BEC9C9FBA0D4CF1B", "D2FD8EA432…
$ rideable_type <ord> docked_bike, docked_bike, docked_bike, docked_bike,…
$ started_at <dttm> 2020-07-09 15:22:02, 2020-07-24 23:56:30, 2020-07-…
$ ended_at <dttm> 2020-07-09 15:25:52, 2020-07-25 00:20:17, 2020-07-…
$ trip_route <chr> "Ritchie Ct & Banks St to Wells St & Evergreen Ave"…
$ start_station_name <chr> "Ritchie Ct & Banks St", "Halsted St & Roscoe St", …
$ start_station_id <chr> "180", "299", "329", "181", "268", "635", "113", "2…
$ end_station_name <chr> "Wells St & Evergreen Ave", "Broadway & Ridge Ave",…
$ end_station_id <chr> "291", "461", "156", "94", "301", "289", "140", "31…
$ start_lat <dbl> 41.90687, 41.94367, 41.93259, 41.89076, 41.91172, 4…
$ start_lng <dbl> -87.62622, -87.64895, -87.63643, -87.63170, -87.626…
$ end_lat <dbl> 41.90672, 41.98404, 41.93650, 41.91831, 41.90799, 4…
$ end_lng <dbl> -87.63483, -87.66027, -87.64754, -87.63628, -87.631…
$ member_casual <ord> member, member, casual, casual, member, casual, mem…
$ trip_duration <dbl> 3.833333, 23.783333, 7.250000, 20.933333, 5.133333,…
$ weekday_day <ord> qui, sex, qua, sex, sáb, ter, qui, seg, qui, seg, s…
$ is_weekend <ord> no, no, no, no, yes, no, no, no, no, no, no, no, no…
$ date_month <ord> jul, jul, jul, jul, jul, jul, jul, jul, jul, jul, j…
$ date_hour <ord> 15, 23, 19, 19, 10, 16, 11, 16, 11, 18, 15, 18, 9, …
$ date_season <ord> summer, summer, summer, summer, summer, summer, sum…
$ day_time <ord> afternoon, night, night, night, morning, afternoon,…
# Using Hmisc package
bikeshare_summary <- bikeshare_data |>
select(-c(ride_id, start_station_name, trip_route)) |>
describe(descript = "Statistical Description Summary", tabular = TRUE)
html(bikeshare_summary, exclude1 = TRUE, align = "c", scroll = TRUE, rows = 50)
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 3 |
Value docked_bike electric_bike classic_bike Frequency 2170489 2440349 3238098 Proportion 0.277 0.311 0.413
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 2 |
Value member casual Frequency 4344181 3504755 Proportion 0.553 0.447
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7848936 | 0 | 31695 | 1 | 24.29 | 27.51 | 3.600 | 4.650 | 7.433 | 13.017 | 23.617 | 40.867 | 62.300 |
| lowest : | 2.016667 | 2.033333 | 2.050000 | 2.066667 | 2.083333 |
| highest: | 52701.383333 | 53921.600000 | 54283.350000 | 55691.683333 | 55944.150000 |
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 7 |
Value dom seg ter qua qui sex sáb Frequency 1192589 968361 1010452 1062400 1049751 1157123 1408260 Proportion 0.152 0.123 0.129 0.135 0.134 0.147 0.179
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 2 |
Value yes no Frequency 2600849 5248087 Proportion 0.331 0.669
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 12 |
Value jan fev mar abr mai jun jul ago set
Frequency 93893 48012 222701 328578 516940 709441 1340055 1387323 1252437
Proportion 0.012 0.006 0.028 0.042 0.066 0.090 0.171 0.177 0.160
Value out nov dez
Frequency 986096 598027 365433
Proportion 0.126 0.076 0.047
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 24 |
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 4 |
Value winter spring summer fall Frequency 364606 1554959 3979815 1949556 Proportion 0.046 0.198 0.507 0.248
| n | missing | distinct |
|---|---|---|
| 7848936 | 0 | 4 |
Value dawn morning afternoon night Frequency 335805 1791759 3505742 2215630 Proportion 0.043 0.228 0.447 0.282
size <- nrow(bikeshare_data)
bikeshare_data |>
group_by(date_hour) |>
summarise(percent = round((n() / size) * 100, 2)) |>
arrange(desc(percent)) |>
kable(caption = "Total percent usage by the hour of the day") |>
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center",
fixed_thead = TRUE
)|>
scroll_box(width = "100%", height = "500px")
| date_hour | percent |
|---|---|
| 17 | 10.03 |
| 18 | 8.80 |
| 16 | 8.39 |
| 15 | 7.13 |
| 19 | 6.50 |
| 14 | 6.48 |
| 13 | 6.39 |
| 12 | 6.25 |
| 11 | 5.32 |
| 20 | 4.52 |
| 10 | 4.23 |
| 8 | 4.14 |
| 9 | 3.71 |
| 7 | 3.49 |
| 21 | 3.41 |
| 22 | 2.87 |
| 23 | 2.12 |
| 6 | 1.94 |
| 0 | 1.41 |
| 1 | 0.96 |
| 5 | 0.72 |
| 2 | 0.58 |
| 3 | 0.32 |
| 4 | 0.29 |
Analyzing the data generated by the “describe” function we can infer that:
Regarding the type of bikes, “classic_bike” is more than 41% of all trips, followed by “eletric_bike” with 31% and “docked_bike” with 28%;
Regarding to the type of user, “member” represents 55% while “casual” represents 45%;
Regarding to the day, “weekend” represents 33% of the races with a peak on Saturday and a minimum on Monday;
Regarding the time of day, it can be observed that the peak occurs at 17, 18 and 16 hours. The races decrease from afternoon, night, morning, until dawn.
Regarding to the month and season, the values decrease from summer,fall, spring to winter. With the busiest months being August, July, September and Octuber and the least busy months being February, January and March;
Regarding to the duration of the trip, the average duration is 24.26 minutes.
bikeshare_skim_member <- bikeshare_data |>
select(-c(ride_id, start_station_name, trip_route)) |>
group_by(member_casual) |>
skimr::skim() |>
as_tibble()
bikeshare_skim_member |>
skimr::yank("numeric") |>
as_tibble() |>
kable(caption = "Trip duration difference between Casual users and members") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center", fixed_thead = TRUE) |>
scroll_box(width = "100%", height = "200px")
| skim_variable | member_casual | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| trip_duration | member | 0 | 1 | 14.71053 | 45.9871 | 2.016667 | 6.183333 | 10.43333 | 17.91667 | 33421.37 | ▇▁▁▁▁ |
| trip_duration | casual | 0 | 1 | 36.15625 | 306.0679 | 2.016667 | 10.000000 | 17.55000 | 32.28333 | 55944.15 | ▇▁▁▁▁ |
bikeshare_skim_member |>
skimr::yank("factor") |>
as_tibble() |>
kable(caption = "Behavior difference of Members and Casual users") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center", fixed_thead = TRUE) |>
scroll_box(width = "100%", height = "500px")
| skim_variable | member_casual | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|---|
| rideable_type | member | 0 | 1 | TRUE | 3 | cla: 1982937, ele: 1318011, doc: 1043233 |
| rideable_type | casual | 0 | 1 | TRUE | 3 | cla: 1255161, doc: 1127256, ele: 1122338 |
| weekday_day | member | 0 | 1 | TRUE | 7 | qua: 672545, qui: 645504, sex: 641688, ter: 640662 |
| weekday_day | casual | 0 | 1 | TRUE | 7 | sáb: 782770, dom: 655855, sex: 515435, qui: 404247 |
| is_weekend | member | 0 | 1 | TRUE | 2 | no: 3181957, yes: 1162224 |
| is_weekend | casual | 0 | 1 | TRUE | 2 | no: 2066130, yes: 1438625 |
| date_month | member | 0 | 1 | TRUE | 12 | ago: 700245, set: 670754, jul: 642136, out: 593567 |
| date_month | casual | 0 | 1 | TRUE | 12 | jul: 697919, ago: 687078, set: 581683, out: 392529 |
| date_hour | member | 0 | 1 | TRUE | 24 | 17: 456215, 18: 390430, 16: 368660, 15: 292211 |
| date_hour | casual | 0 | 1 | TRUE | 24 | 17: 331026, 18: 300233, 16: 289602, 15: 267719 |
| date_season | member | 0 | 1 | TRUE | 4 | sum: 2013135, fal: 1269120, spr: 807456, win: 254470 |
| date_season | casual | 0 | 1 | TRUE | 4 | sum: 1966680, spr: 747503, fal: 680436, win: 110136 |
| day_time | member | 0 | 1 | TRUE | 4 | aft: 1894379, mor: 1167354, nig: 1143103, daw: 139345 |
| day_time | casual | 0 | 1 | TRUE | 4 | aft: 1611363, nig: 1072527, mor: 624405, daw: 196460 |
Regarding the difference in usage between members and casual users, we can observe the following:
The trip duration is 2,45 times longer for casual users than members. Averaging 36.15 min for casual users and 14.71 min for members;
Regarding the type of bicycle, the most used for members, in descending order, are “classic”, “eletric”, and “docked”. For casual users they are “classic”, “docked” and “eletric”;
Regarding the time of year, both users follow the general average with a peak in summer and less use in winter, however for casual users spring is busier than fall;
The busiest member months are August, September and July. For casual users, the busiest months are July, August and September;
Regarding the day of the week, the busiest days for members, in descending order, are Wednesday, Thursday and Friday. For casual users, the busiest days are Saturday, Sunday and Friday. With greater usage of the service on weekends for casual members compared to members;
Regarding the time of day both types of users have more runs in the afternoon, however in casual members the night is busier than in the morning.
The Stations and routes more often used are the following:
bikeshare_data |>
group_by(start_station_name) |>
summarise(number_of_trips = n()) |>
arrange(-number_of_trips) |>
drop_na(start_station_name) |>
slice(1:10) |>
kable(caption = "Top 10 most popular station") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center", fixed_thead = TRUE)
| start_station_name | number_of_trips |
|---|---|
| Streeter Dr & Grand Ave | 108220 |
| Clark St & Elm St | 62730 |
| Theater on the Lake | 60991 |
| Michigan Ave & Oak St | 60550 |
| Wells St & Concord Ln | 60330 |
| Millennium Park | 59264 |
| Wells St & Elm St | 52592 |
| Clark St & Armitage Ave | 49025 |
| Clark St & Lincoln Ave | 48212 |
| Lake Shore Dr & Monroe St | 48087 |
bikeshare_data |>
group_by(trip_route) |>
summarise(number_of_trips = n()) |>
arrange(-number_of_trips) |>
drop_na(trip_route) |>
slice(1:10) |>
kable(caption = "Top 10 most popular route") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center", fixed_thead = TRUE)
| trip_route | number_of_trips |
|---|---|
| Streeter Dr & Grand Ave to Streeter Dr & Grand Ave | 16190 |
| Lake Shore Dr & Monroe St to Lake Shore Dr & Monroe St | 9421 |
| Millennium Park to Millennium Park | 9291 |
| Michigan Ave & Oak St to Michigan Ave & Oak St | 8816 |
| Buckingham Fountain to Buckingham Fountain | 6818 |
| Theater on the Lake to Theater on the Lake | 6147 |
| Indiana Ave & Roosevelt Rd to Indiana Ave & Roosevelt Rd | 6136 |
| Ellis Ave & 60th St to Ellis Ave & 55th St | 5871 |
| Fort Dearborn Dr & 31st St to Fort Dearborn Dr & 31st St | 5839 |
| Ellis Ave & 55th St to Ellis Ave & 60th St | 5339 |