Cyclistic is a bike-share company in Chicago. Since 2016, it has operated a growing fleet of 5,800+ bikes across 692 stations. The company offers flexible pricing plans: single-ride, full-day passes, and annual memberships.
Customers fall into two categories:
Casual Riders: Purchase single or daily passes
Annual Members: Subscribe for long-term use
The company has found that annual members are significantly more
profitable. Instead of focusing solely on acquiring new customers, the
marketing team wants to convert casual riders into annual
members. To support this, the data analytics team has been
asked to identify usage differences between the two rider types.
Goal: Identify how casual riders and annual members use Cyclistic bikes differently, in order to inform targeted marketing strategies aimed at converting casual riders into members.
The dataset is based on real-world data provided by Motivate
International Inc., and made available under an open license. For
this case study, it offers a reliable foundation to explore usage trends
and differences between casual riders and annual members. The data is
from 06-2024 to 05-2025.
The data we have follows ROCCC:
Reliable – Data is collected directly from Divvy’s automated bike tracking system, ensuring accuracy and consistency.
Original – It is first-party data sourced directly from the official bike-sharing company.
Comprehensive – Covers a full year of rides with detailed trip, time, location, and rider type information.
Current – Updated monthly, with the latest data extending to June 2025.
Cited – Officially provided by Divvy (Lyft Bikes and Scooters) through their public data portal.
This section outlines the steps taken to clean and prepare the
dataset for analysis.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("lubridate")
library("ggplot2")
library("dplyr")
library("stringr")
dataset <- read.csv("C:\\Users\\saran\\Desktop\\data analysis projects\\case study\\Cyclistic\\Raw data\\202406-divvy-tripdata.csv")
nrow(dataset)
## [1] 710721
duplicates <- dataset[duplicated(dataset),]
duplicates
## [1] ride_id rideable_type started_at ended_at
## [5] start_station_name start_station_id end_station_name end_station_id
## [9] start_lat start_lng end_lat end_lng
## [13] member_casual
## <0 rows> (or 0-length row.names)
n_distinct(dataset$rideable_type)
## [1] 2
unique(dataset$rideable_type)
## [1] "electric_bike" "classic_bike"
n_distinct(dataset$member_casual)
## [1] 2
unique(dataset$member_casual)
## [1] "casual" "member"
Trim any leading or trailing whitespace from character columns.
dataset <- dataset %>%
mutate(across(where(is.character), str_trim))
We filter out rows with missing or empty values in important fields like station names and IDs.
dataset <- dataset %>%
filter(
!(start_station_name == "" | is.na(start_station_name)),
!(start_station_id == "" | is.na(start_station_id)),
!(end_station_name == "" | is.na(end_station_name)),
!(end_station_id == "" | is.na(end_station_id))
) %>%
mutate(
start_lat = round(as.numeric(start_lat), 6),
start_lng = round(as.numeric(start_lng), 6),
end_lat = round(as.numeric(end_lat), 6),
end_lng = round(as.numeric(end_lng), 6)
)
We convert station IDs to characters, remove any decimals, and pad them with leading zeros for consistent length
dataset <- dataset %>%
mutate(
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id),
start_station_id = gsub("\\.0$", "", start_station_id),
end_station_id = gsub("\\.0$", "", end_station_id),
start_station_id = str_pad(start_station_id, width = 12, side = "left", pad = "0"),
end_station_id = str_pad(end_station_id, width = 12, side = "left", pad = "0")
)
We convert the started_at and ended_at columns to proper date-time formats, calculate ride durations in seconds, and convert it to a more readable format.
dataset <- dataset %>%
mutate(
started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at),
start_date = as.Date(started_at),
end_date = as.Date(ended_at),
start_hour = hour(started_at),
end_hour = hour(ended_at),
ride_duration_secs = as.numeric(difftime(ended_at, started_at, units = "secs"))
) %>%
filter(ride_duration_secs > 0) %>%
mutate(
ride_duration = sprintf(
"%02d:%02d:%02d",
as.integer(ride_duration_secs) %/% 3600,
(as.integer(ride_duration_secs) %% 3600) %/% 60,
as.integer(ride_duration_secs) %% 60
)
) %>%
select(-ride_duration_secs, -started_at, -ended_at)
After the cleaning process, we save the cleaned dataset to a CSV file for further analysis.
write_csv(dataset, "data_cleaned/2025_06_cleaned.csv")
We combine the cleaned data from multiple months to form quarterly datasets. This is useful for creating time-based analysis and comparisons.
q1_files <- c("data_cleaned/2024_06_cleaned.csv",
"data_cleaned/2024_07_cleaned.csv",
"data_cleaned/2024_08_cleaned.csv")
q2_files <- c("data_cleaned/2024_09_cleaned.csv",
"data_cleaned/2024_10_cleaned.csv",
"data_cleaned/2024_11_cleaned.csv")
q3_files <- c("data_cleaned/2024_12_cleaned.csv",
"data_cleaned/2025_01_cleaned.csv",
"data_cleaned/2025_02_cleaned.csv")
q4_files <- c("data_cleaned/2025_03_cleaned.csv",
"data_cleaned/2025_04_cleaned.csv",
"data_cleaned/2025_05_cleaned.csv")
combine_quarter <- function(file_list) {
file_list %>%
map_dfr(read_csv)
}
q1 <- combine_quarter(q1_files)
q2 <- combine_quarter(q2_files)
q3 <- combine_quarter(q3_files)
q4 <- combine_quarter(q4_files)
write_csv(q1, "data_quarters/Q1_2024_cleaned.csv")
write_csv(q2, "data_quarters/Q2_2024_cleaned.csv")
write_csv(q3, "data_quarters/Q3_2024_2025_cleaned.csv")
write_csv(q4, "data_quarters/Q4_2025_cleaned.csv")
q1 <- read.csv("C:/Users/saran/Desktop/data analysis projects/case study/Cyclistic/data_quarters/Q1_2024_cleaned.csv")
q1_cleaned <- q1 %>%
filter(ride_duration != "00:00:00")
q1_cleaned <- q1_cleaned[!(q1_cleaned$ride_duration > "00:50:00" | q1_cleaned$ride_duration < "00:02:00"), ]
q1_cleaned$day_of_week <- weekdays(as.Date(q1_cleaned$start_date))
q1_cleaned$day_of_week <- factor(
q1_cleaned$day_of_week,
levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
)
Initial look at the quarter data showed, several outliers on both ends, some at 0s and others around 25 hours. Picking a reasonable range from 2 mins to 50 mins for our exploration. Also introduced columns for weekdays and months.
We start by analyzing the number of rides based on member type (i.e., whether the rider is a “member” or a “casual” user). This gives us an overview of the distribution of rides across these two categories.
q1_cleaned %>%
count(member_casual)
## member_casual n
## 1 casual 583300
## 2 member 873977
Next, we explore the number of rides based on the type of bike used (rideable_type). This helps us understand the distribution of bike usage across different types.
q1_cleaned %>%
count(rideable_type)
## rideable_type n
## 1 classic_bike 971142
## 2 electric_bike 486135
Classic bikes can be seen to be more favored among the riders.
We can further investigate how each member type uses the different types of bikes.
q1_cleaned %>%
count(rideable_type, member_casual)
## rideable_type member_casual n
## 1 classic_bike casual 382067
## 2 classic_bike member 589075
## 3 electric_bike casual 201233
## 4 electric_bike member 284902
We now analyze the ride duration for both members and casual users. To do this, we calculate the average and median ride duration in minutes for each group.
q1_cleaned %>%
mutate(ride_duration_secs = as.numeric(hms(ride_duration))) %>%
group_by(member_casual) %>%
summarise(
avg_ride_duration = mean(ride_duration_secs, na.rm = TRUE) / 60,
median_ride_duration = median(ride_duration_secs, na.rm = TRUE) / 60
)
## # A tibble: 2 × 3
## member_casual avg_ride_duration median_ride_duration
## <chr> <dbl> <dbl>
## 1 casual 16.5 13.4
## 2 member 12.5 9.77
The difference between the average and median suggests there could a few outliers that are pulling the average towards them. The median gives a better picture regarding trip durations.
Next, we analyze the number of rides that occurred on each day of the we
q1_cleaned %>%
group_by(day_of_week) %>%
summarise(rides = n())
## # A tibble: 7 × 2
## day_of_week rides
## <fct> <int>
## 1 Monday 185912
## 2 Tuesday 192848
## 3 Wednesday 218415
## 4 Thursday 205329
## 5 Friday 213696
## 6 Saturday 241968
## 7 Sunday 199109
We also explore how the number of rides is distributed across different days of the week for each member type.
q1_cleaned %>%
group_by(day_of_week, member_casual) %>%
summarise(rides = n()) %>%
pivot_wider(names_from = member_casual, values_from = rides) %>%
rename_with(~ paste0(.x, "_rides"), -day_of_week)
## # A tibble: 7 × 3
## # Groups: day_of_week [7]
## day_of_week casual_rides member_rides
## <fct> <int> <int>
## 1 Monday 62199 123713
## 2 Tuesday 61256 131592
## 3 Wednesday 75541 142874
## 4 Thursday 73176 132153
## 5 Friday 88844 124852
## 6 Saturday 124490 117478
## 7 Sunday 97794 101315
Number of rides increase around weekend for casual riders whereas for members it stays roughly the same across the week. But members trips remain far more than casual riders trips.
Understanding the distribution of rides across different hours of the day can provide insights into peak usage times.
q1_cleaned %>%
group_by(start_hour) %>%
summarise(rides = n()) %>%
arrange(desc(rides))
## # A tibble: 24 × 2
## start_hour rides
## <int> <int>
## 1 17 153096
## 2 16 131362
## 3 18 124729
## 4 15 100213
## 5 19 93998
## 6 14 86482
## 7 13 85259
## 8 12 83941
## 9 8 76121
## 10 11 72384
## # ℹ 14 more rows
The most common hours seem to be evening and early morning which are usually commute hours.
Next, we look at the ride distribution based on start station names, which helps in identifying the most popular stations.
q1_cleaned %>%
group_by(start_station_name) %>%
summarise(rides = n()) %>%
arrange(desc(rides))
## # A tibble: 1,440 × 2
## start_station_name rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 24164
## 2 DuSable Lake Shore Dr & North Blvd 18834
## 3 Michigan Ave & Oak St 15334
## 4 DuSable Lake Shore Dr & Monroe St 15166
## 5 Theater on the Lake 12872
## 6 Kingsbury St & Kinzie St 12115
## 7 Clark St & Elm St 11265
## 8 Millennium Park 10602
## 9 Wells St & Concord Ln 10419
## 10 Shedd Aquarium 9449
## # ℹ 1,430 more rows
We also filter the data to analyze the distribution of rides based on stations specifically for members.
q1_cleaned %>%
filter(member_casual == 'member') %>%
group_by(start_station_name) %>%
summarise(rides = n()) %>%
arrange(desc(rides))
## # A tibble: 1,314 × 2
## start_station_name rides
## <chr> <int>
## 1 Kingsbury St & Kinzie St 8562
## 2 DuSable Lake Shore Dr & North Blvd 7764
## 3 Clark St & Elm St 7315
## 4 Clinton St & Washington Blvd 6924
## 5 Clinton St & Madison St 6399
## 6 Wells St & Concord Ln 6279
## 7 Streeter Dr & Grand Ave 5885
## 8 Dearborn St & Erie St 5878
## 9 State St & Chicago Ave 5817
## 10 Wells St & Elm St 5793
## # ℹ 1,304 more rows
Similarly, we analyze the ride distribution based on stations specifically for casual users.
q1_cleaned %>%
filter(member_casual == 'casual') %>%
group_by(start_station_name) %>%
summarise(rides = n()) %>%
arrange(desc(rides))
## # A tibble: 1,313 × 2
## start_station_name rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 18279
## 2 DuSable Lake Shore Dr & Monroe St 11260
## 3 DuSable Lake Shore Dr & North Blvd 11070
## 4 Michigan Ave & Oak St 9667
## 5 Shedd Aquarium 7619
## 6 Theater on the Lake 7356
## 7 Millennium Park 6986
## 8 Dusable Harbor 6664
## 9 Adler Planetarium 4625
## 10 Montrose Harbor 4625
## # ℹ 1,303 more rows
A similar process was followed for the rest of the quarters as well. Now combining all of them into one we will start exploring a patterns over the year.
The below graph shows that across the entire year members around 2.3 million total trips compared to around 1.27 million for casual riders. Which is around 81% more trips!
If we check the number of trips across the year, summer season seems to be favored by both casual and members alike. The usage of bikes increases starting May and is well above till the end of Fall.
Observing the number of rides tells us that members make trips far more on weekdays, and for casual riders it increases on weekends. This is likely due to members using the bikes for daily commute to work. Whereas casual riders using them for recreation.
This notion is further strengthened by looking at the average ride lengths.
The duration remains almost same for members on all days whilst casual riders ride for longer periods.
During this visualization I came across a third category of bike, the electric scooter. This was likely due to a typo, since this category was only present in one of the months and therefore, i had to clean it up. This showed that cleaning is an iterative process and during the visualizations many things can come forward which might have been missed in the first pass.
Classic bikes are clearly more preferred among riders. This might be due to the cost of it being less compared to their electric counterparts.
Above plot shows the top 10 stations among riders. But dividing it further based on members and casual riders, introduces us to an interesting finding.
The top preferred stations for casual riders is along the coast line near tourist spots compared to members for whom top used stations are within city likely near residential and business areas.
To convert casual riders to members, below are some steps that can be taken:
The analysis and visualization of a year’s Cyclistic ride data revealed distinct patterns between casual and member riders. Members use the service primarily for commuting, while casual riders tend to engage in longer, recreational rides. By implementing targeted promotional campaigns, seasonal plans, and highlighting cost savings, Cyclistic can effectively convert casual riders into loyal members.
Gained practical experience in cleaning, transforming, and analyzing real-world datasets using R.
Learned how to create new time-based variables (e.g., ride duration, start hours, days of the week) to better understand user behavior.
Improved my ability to perform exploratory data analysis (EDA) to extract actionable insights.
Enhanced skills in translating raw data into meaningful business strategies, such as creating targeted membership offers.
Strengthened my proficiency in using tidyverse, lubridate, and ggplot2 for data manipulation, summarization, and visualization.