User health analysis

User health analysis

Rol: Data Analyst

Herramientas: R (Lenguaje de programación)

Plazo: 2 semanas

Inicio del proyecto: Agosto 2024

Introduction

Introduction

This documentation is part of the final project for the Google Data Analytics Course and focuses on the Bellabeat case study. The project serves as a demonstration of the key concepts and skills I have learned throughout the course, particularly the six fundamental steps of data analysis: Ask, Prepare, Process, Analyze, Share, and Act.

These steps will guide the structure and content of this project, providing a comprehensive overview of the analytical approach taken to address the case study.

Scenario

I am a junior data analyst at Bellabeat, a company that makes health-focused smart devices for women. Bellabeat has growth potential in the global smart device market. Urška Sršen, the cofounder, believes that analyzing smart fitness device data can help the company grow. My task is to analyze this data, gain insights into user behavior, and use these findings to guide the company’s marketing strategy. I’ll be presenting my analysis and recommendations to the executive team when I reach conclusions.

The company: Bellabeat

Bellabeat, founded by Urška Sršen and Sando Mur in 2013, is a health-focused tech company for women, offering smart products that track activity, sleep, stress, and reproductive health. The company has grown rapidly, expanding globally and selling through online retailers and their website. Bellabeat heavily invests in digital marketing, including Google Search, social media, and video ads. Sršen believes analyzing smart device usage data will uncover new growth opportunities, and she has tasked the marketing analytics team to analyze this data and provide recommendations to inform their marketing strategy.

Products

Bellabeat offers a range of smart wellness products for women, all connected through the Bellabeat app, which tracks health data like activity, sleep, stress, and mindfulness.

Leaf: A wearable wellness tracker (bracelet, necklace, or clip) that monitors activity, sleep, and stress.

Time: A wellness watch combining a classic timepiece design with smart tracking for activity, sleep, and stress.

Spring: A smart water bottle that tracks daily hydration and syncs with the app to monitor water intake.

Membership Program: A subscription service offering personalized 24/7 guidance on nutrition, activity, sleep, beauty, and mindfulness based on individual goals.

These products work together to provide users with a holistic view of their health and wellness.

1: Ask Phase

Business Task

The objective of this analysis is to identify key trends in the usage of non-Bellabeat smart devices and apply these insights to a specific Bellabeat product. The insights will help guide Bellabeat’s marketing strategy and highlight how current smart device trends can influence customer behavior and product use. The product selected for applying these insights will be the Bellabeat app, given its central role in connecting all Bellabeat devices.

Stakeholders

The key stakeholders for this analysis include:

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer, who is interested in data insights to guide the company’s growth and product development strategies.

Sando Mur: Bellabeat’s cofounder and key member of the executive team, who focuses on the company’s long-term strategy and alignment with market trends.

Bellabeat marketing analytics team: The team responsible for leveraging data to guide marketing efforts, increase user engagement, and align product features with consumer needs.

Audience for the Analysis

The audience for this analysis is primarily the Bellabeat executive team, including Sršen, Mur, and other decision-makers within the company. Given this, the analysis must be:

Concise and actionable: Focused on high-level trends and insights that the executives can act on.

Data-driven, with clear recommendations: Presenting key findings with strong visualizations and supporting data, while also providing clear recommendations for marketing strategy and product improvements.

Strategic: Addressing both immediate marketing opportunities and long-term product development possibilities.

How will this data help stakeholders make decisions?

1. Product Development: Insights into the most-used features (e.g., calorie tracking, hydration reminders) will help Sršen and Mur decide which features to improve or promote in future product updates.

2. Marketing Strategy: By identifying trends in smart device usage, the marketing team can adjust campaigns to emphasize popular features (e.g., stress monitoring, hydration tracking) and highlight Bellabeat’s holistic health solutions.

2: Prepare phase

Data Integrity & Data Integrity

To ensure the integrity of the data used for analysis, I’ve selected the Fitbit Fitness Tracker Data, which is a publicly available dataset on Kaggle provided through Mobius. This dataset contains personal fitness tracker information from thirty consenting Fitbit users, including minute-level data on physical activity, heart rate, and sleep monitoring.

Data Bias

While this dataset offers valuable insights into users daily habits, it’s important to acknowledge its limitations. For instance, the small sample size (3o users) may not represent the broader population of smart device users, potentially skewing the findings. Additionally, since the data is self-reported and collected from a specific group of users, there may be biases in the data collection process.

Data organization

For this project I will be using 8 csv files, these files store data that compose 8 different main variables:

  1. Daily activity

  2. Daily sleep

  3. Daily steps

  4. Hourly steps

  5. Hourly calories

  6. Hourly intensities

  7. Weight

  8. Heart rate

3: Process Phase

My toolkit

I’m utilizing R for my research as it is superb at dealing with sizable datasets, like the numerous CSV files I’m using for this undertaking. It has the capability to load, merge, and manage all the data seamlessly. In addition, with modules such as dplyr and tidyr, the process of tidying and sorting the data becomes quite simple. Moreover, generating high-quality visualizations is a breeze with ggplot2, enabling me to swiftly convert my data into invaluable, comprehensible insights, even when dealing with massive data quantities, making it possible for stakeholders to comprehend the implications of our gathered data.

File storage

First things first, I will load the different csv files to my environment on R Studio with the following code:

daily_activity <- read.csv("dailyActivity_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
heart_rate <- read.csv("heartrate_seconds_merged.csv")
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
heart_rate <- read.csv("heartrate_seconds_merged.csv")

I have stored the different files in variables with simpler names for their easier future use.

After running the code we can see the data in the environment panel:

My toolkit

I’m using R for my analysis because it’s great at handling large datasets, like the multiple CSV files I have for this project. It can easily load, combine, and manage all the data without issues. Plus, with packages like dplyr and tidyr, cleaning and organizing the data is super straightforward. On top of that, ggplot2 makes it easy to create great visualizations, so I can quickly turn my data into clear, useful insights, even with large amounts of information so that stakeholders can understand what the data collected means.

File storage

First things first, I will load the different csv files to my environment on R Studio with the following code:

Data preview

After loading all the files I needed for the investigation, I started divulging the data inside, for that, I started by seeing the different columns that each of these tables had with the “colnames function, for this example I chose “Daily activity”:

I’m using R for my analysis because it’s great at handling large datasets, like the multiple CSV files I have for this project. It can easily load, combine, and manage all the data without issues. Plus, with packages like dplyr and tidyr, cleaning and organizing the data is super straightforward. On top of that, ggplot2 makes it easy to create great visualizations, so I can quickly turn my data into clear, useful insights, even with large amounts of information so that stakeholders can understand what the data collected means.

colnames(daily_activity)

Output:

This gives me a quick idea of the kind of data I will be working with. From the 15 total columns, “Calories” and “Total Steps” looked like useful data at first sight.

Next, to go deeper into the data, I have used the View function to see the type of values that were stored in the cells of the tables:

View(daily_activity)
View(daily_sleep)
View(daily_steps)
View(hourly_steps)
View(hourly_calories)
View(hourly_intensities)
View(weight)
View(heart_rate)

After running the code R creates a different tab to visualize the data more clearly and with more detail:

We can see types of data such as Integers, Dates and Floats.

Packages

Before I started analyzing deeper into the data, I wanted to make sure I had the necessary tools that to work with, such tools are the packages I downloaded below:

install.packages("tidyverse")
install.packages("here")
install.packages("skimr")
install.packages("janitor")
install.packages("lubridate")
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate)

These are some of the packages that I have worked with in the course, their functions are the following:

tidyverse: A collection of essential packages (like dplyr, ggplot2, tidyr, etc.) designed for data manipulation, visualization, and general workflow in R. It simplifies data handling and makes working with data more intuitive.

here: Helps manage file paths in a more reliable way, making it easier to organize and access project files, especially when working with multiple CSVs.

skimr: Provides a fast, clear summary of data, offering key statistics at a glance, making it easier to understand the structure of your datasets.

janitor: Useful for cleaning messy data, particularly for tasks like cleaning column names, removing duplicates, and handling common data issues.

lubridate: Simplifies working with dates and times, making it easier to parse, manipulate, and analyze date-time data across your files.

Cleaning process

I’ll be doing a cleaning process because before I can dive into analyzing the data, I need to make sure everything’s organized and consistent. This means dealing with messy column names, missing values, duplicates or any weird formatting that might throw things off. By cleaning the data first, I’ll be able to work with it smoothly and get more accurate results when I start analyzing and visualizing it.

Duplicates

I chose to use the Janitor package to start our cleaning journey, I will be making use of pipes and the get dupes function to get the dupes and then visualize them.

daily_activity %>% get_dupes() %>% View()

daily_activity %>% get_dupes() %>% View()

I did this process with every variables and ended up finding 6 duplicates in the Daily Sleep variable:

The next step would be to update the variable without the duplicates on it.

I was not gonna mention the mistake I made during this step, but I think it is essential to demonstrate how making mistakes is part of the process.

My first try looked like this:

daily_sleep <- daily_sleep %>% distinct() %>% View()

daily_sleep <- daily_sleep %>% distinct() %>% View()

What’s wrong? Well, View() is just meant to display the data, not to modify it. So, when I assign the result of the whole line back to daily_sleep, it ends up taking the result of View(), which is NULL, so essentially, I was saving a NULL value to the daily_sleep, ruining all the data.

To fix it, I stored the distinct values first and then call View() separately. Like this:

daily_sleep <- daily_sleep %>% distinct()
View(daily_sleep)
daily_sleep <- daily_sleep %>% distinct()
View(daily_sleep)

The output:

As you can see, the entries (rows) went down from 414 to 410, removing all the duplicates from the daily_sleep variable.

Missing values (NA)

For missing values I am going to be using the Skimr package, making use of the Skim function:

skim(daily_activity)
skim(daily_sleep)
skim(daily_steps)
skim(hourly_steps)
skim(hourly_calories)
skim(hourly_intensities)
skim(weight)
skim(heart_rate)

This function will output a summary of the data in the variable, including the missing values, which is what I am looking for.

After running the code, I see that from all variables, only 1 variable contains missing values, the weight variable, which has 65 missing values in the “Fat” column.

Next thing, I will remove those rows with missing values and update the weight variable using drop_na function:

weight <- weight %>% drop_na()

After I ran the code I made sure with Skim that the missing values were gone, and they were:

Trimming

I thought that trimming possible white spaces with the str_trim function might be a great idea, but there are no strings in the whole dataset, so it would be useless.

Consistency

The next thing I will do is to use the Janitor package to clean the names of the columns to ensure consistency throughout all my dataset.

I have used the clean_names function to make this process work, this will make use of lowercase and snake_case.

Columns before:

daily_activity <- daily_activity %>% clean_names()
daily_sleep <- daily_sleep %>% clean_names()
daily_steps <- daily_steps %>% clean_names()
hourly_steps <- hourly_steps %>% clean_names()
hourly_calories <- hourly_calories %>% clean_names()
hourly_intensities <- hourly_intensities %>% clean_names()
weight <- weight %>% clean_names()
heart_rate <- heart_rate %>% clean_names()
daily_activity <- daily_activity %>% clean_names()
daily_sleep <- daily_sleep %>% clean_names()
daily_steps <- daily_steps %>% clean_names()
hourly_steps <- hourly_steps %>% clean_names()
hourly_calories <- hourly_calories %>% clean_names()
hourly_intensities <- hourly_intensities %>% clean_names()
weight <- weight %>% clean_names()
heart_rate <- heart_rate %>% clean_names()

Columns now:

Merging

For future analysis I need to merge both the Daily Activity table to the Daily Sleep table, to investigate variables like time slept and quantity of steps taken.

I use the rename function to create a consistency in the name of both date column names, I decided to rename both to “date”, then I use mutate to change the format of the dates for them to be compatible:

# Clean the daily_activity date
daily_activity <- daily_activity %>%
 rename(date = activitydate) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) 
# Clean the daily_sleep date
daily_sleep <- daily_sleep %>%
 rename(date = sleep_date) %>% 
 mutate(date = as.Date(date, format = "%m/%d/%Y %I:%M:%S %p"))
# Clean the daily_activity date
daily_activity <- daily_activity %>%
 rename(date = activitydate) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) 
# Clean the daily_sleep date
daily_sleep <- daily_sleep %>%
 rename(date = sleep_date) %>% 
 mutate(date = as.Date(date, format = "%m/%d/%Y %I:%M:%S %p"))

Next, I use the columns “id” and “date” to merge them, then I store the merge in a variable called “daily_activity_sleep”.

# Merge the two datasets by date
daily_activity_sleep <- merge(daily_activity, daily_sleep, by = c("id", "date"), all = TRUE)
# View the merged dataset
View(daily_activity_sleep)
# Merge the two datasets by date
daily_activity_sleep <- merge(daily_activity, daily_sleep, by = c("id", "date"), all = TRUE)
# View the merged dataset
View(daily_activity_sleep)

4 & 5: Analyze phase & Share Phase

In the analyze and share phase, I’m going to check out the merged dataset to see if I can spot any patterns and trends to tell a story. I’ll be looking at how different factors connect and any changes over time. I’ll use some visuals to make the findings clearer, helping me figure out what it all means to share recommendations for future strategic decisions.

Dates

I need to corroborate the range of the sample dates, so we use the range function:

date_range <- range(daily_activity_sleep$date)
print(date_range)
date_range <- range(daily_activity_sleep$date)
print(date_range)

Output:

This dataset has a sample with a range of 1 month and is 8 years old, this could be a small and outdated sample and could potentially lead to inaccurate conclusions.

Users

I will be arranging different information from the users of the dataset first.

Quantity of users:

unique_users <- daily_activity_sleep %>%
 distinct(id) %>%
 nrow()
print(unique_users)
unique_users <- daily_activity_sleep %>%
 distinct(id) %>%
 nrow()
print(unique_users)

Output:

There are 33 users in total, which could be a hint that the quantity of data might not be enough to make accurate conclusions.

Average of total steps:

average_steps <- mean(daily_activity_sleep$total_steps, na.rm = TRUE)
print(average_steps)
average_steps <- mean(daily_activity_sleep$total_steps, na.rm = TRUE)
print(average_steps)

Output:

Average sleep time (In minutes)

average_sleep <- mean(daily_activity_sleep$total_minutes_asleep, na.rm = TRUE)
print(average_sleep)
average_sleep <- mean(daily_activity_sleep$total_minutes_asleep, na.rm = TRUE)
print(average_sleep)

Output:

Users Summary

  • There are only 33 users in the total dataset.

  • Users take an average of 7637 steps per day.

  • Users sleep an average of 419 minutes per day, which is approximately 7 hours per day.

User Behavior

Total steps & Calories burnt

I am going to start plotting using ggplot from the dplyr package to have a better understanding of the meaning from the data collected:

ggplot(daily_activity_sleep, aes(x = total_steps , y = calories)) +
 geom_point(color = "steelblue", alpha = 0.6) + # Scatter points with some transparency
 geom_smooth(method = "lm", col = "red", se = FALSE) + # Add a trend line using linear regression
 labs(title = "Correlation between Total Steps and Calories burnt",
 x = "Total Steps",
 y = "Calories") +
 theme_minimal()
ggplot(daily_activity_sleep, aes(x = total_steps , y = calories)) +
 geom_point(color = "steelblue", alpha = 0.6) + # Scatter points with some transparency
 geom_smooth(method = "lm", col = "red", se = FALSE) + # Add a trend line using linear regression
 labs(title = "Correlation between Total Steps and Calories burnt",
 x = "Total Steps",
 y = "Calories") +
 theme_minimal()

Output:

There is a positive correlation of 0.59 between Total Steps and Calories burnt.

Most productive days of the week

The first thing I have to do is use the lubridate package to use the mutate function so I can transform the date to weekdays and add them to a new column in the dataset:

daily_activity_sleep_wdays <- daily_activity_sleep %>%
 mutate(day_of_week = wday(date, label = TRUE, abbr = TRUE))
daily_activity_sleep_wdays <- daily_activity_sleep %>%
 mutate(day_of_week = wday(date, label = TRUE, abbr = TRUE))

Output:

With that column, I can now visualize data through a weekday point of view.

Which Day of the Week Do Users Walk the Most?

I chose to use a bar chart because it clearly shows the data for each day of the week, making it easy to identify trends and patterns.

ggplot(daily_activity_sleep, aes(x = day_of_week, y = total_steps, fill = day_of_week)) +
 stat_summary(fun = "mean", geom = "bar") + # Plot the average steps per day
 labs(title = "Which Day of the Week Do Users Walk the Most?", 
 x = "Day of the Week", 
 y = "Average Total Steps",
 fill= 'Day of the week') +
 theme_minimal() +
 scale_fill_brewer(palette = "Set3") # Use a cool color palette to differentiate days
ggplot(daily_activity_sleep, aes(x = day_of_week, y = total_steps, fill = day_of_week)) +
 stat_summary(fun = "mean", geom = "bar") + # Plot the average steps per day
 labs(title = "Which Day of the Week Do Users Walk the Most?", 
 x = "Day of the Week", 
 y = "Average Total Steps",
 fill= 'Day of the week') +
 theme_minimal() +
 scale_fill_brewer(palette = "Set3") # Use a cool color palette to differentiate days

Output:

We can see that Sunday is the day users walk the least with less than 7000 steps per day and that Tuesday and Saturday are the days people walk the most with an average above 8000 steps per day

Considering that the recommended daily steps is 7000 by Sharp Health Care I will add a line in the graph to determine which days are under the 7000 line.

It seems like Sunday is below the recommended steps per day.

Most productive hours of the day

It is important to know the hours where users make the most steps because it lets me know their routine, their walking or running habits, which gives me useful information about users.

Additionally, knowing peak activity times allows app developers to tailor notifications and reminders for workouts or walking breaks when users are more likely to engage. It also provides insights into how lifestyle factors, like work schedules or social habits, influence physical activity. Ultimately, this knowledge can lead to better health outcomes and a more engaging user experience in fitness apps.

First, I want to clean up those messy timestamps. Instead of seeing “4/12/2016 12:00:00 AM”, I just want to see “12:00:00”. This will make it way easier for me to compare activity across different hours of the day.

hourly_steps <- hourly_steps %>%
 mutate(
 # Convert ActivityHour to datetime
 ActivityHour = mdy_hms(ActivityHour),
 # Extract just the hour in 24-hour format
 hour_of_day = format(ActivityHour, format = "%H:%M:%S")
 )
hourly_steps <- hourly_steps %>%
 mutate(
 # Convert ActivityHour to datetime
 ActivityHour = mdy_hms(ActivityHour),
 # Extract just the hour in 24-hour format
 hour_of_day = format(ActivityHour, format = "%H:%M:%S")
 )

Then I’ll calculate the average steps for each hour — this will show me exactly when users are most active throughout the day.

hourly_averages <- hourly_steps %>%
 group_by(hour_of_day) %>%
 summarise(
 avg_steps = mean(StepTotal, na.rm = TRUE),
 n_observations = n()
 ) %>%
 arrange(hour_of_day)
hourly_averages <- hourly_steps %>%
 group_by(hour_of_day) %>%
 summarise(
 avg_steps = mean(StepTotal, na.rm = TRUE),
 n_observations = n()
 ) %>%
 arrange(hour_of_day)

Finally, I’ll make a bar chart to visualize it, so I can quickly spot peak activity times:

ggplot(hourly_averages, aes(x = hour_of_day, y = avg_steps, fill = avg_steps)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "red", high = "steelblue") +
theme_minimal() +
labs(
title = "Average Steps by Hour of Day",
x = "Hour of Day",
y = "Average Steps",
caption = paste("Based on", nrow(hourly_steps), "observations"),
fill = "Avg Steps"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
)
ggplot(hourly_averages, aes(x = hour_of_day, y = avg_steps, fill = avg_steps)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "red", high = "steelblue") +
theme_minimal() +
labs(
title = "Average Steps by Hour of Day",
x = "Hour of Day",
y = "Average Steps",
caption = paste("Based on", nrow(hourly_steps), "observations"),
fill = "Avg Steps"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
)

Output:

Looking at the graph I can say it paints a pretty clear picture of when people are most active during the day. There’s hardly any movement in those early morning hours when most people are sleeping — we’re talking from midnight until about 5 AM. Then things start picking up around 6 AM as people wake up and start their day.

The really interesting part is that peak activity happens between 5–6 PM, hitting about 600 steps on average. This makes sense since it’s probably capturing both people heading home from work and maybe squeezing in some evening exercise. The whole afternoon period, from noon to 7 PM, stays pretty active overall.

You can see a nice little bump around lunchtime too, which is likely people moving around during their lunch breaks. As the evening rolls in after 7 PM, people start winding down, and the steps gradually decrease until we’re back to those quiet nighttime hours.

With over 22,000 observations backing this up, it’s giving us a really good look at typical daily movement patterns. Pretty much exactly what you’d expect from most people’s daily routine.

Sleeping patterns

It is important to analyze the sleeping patterns of the users as it has an impact on overall mental and physical health. I will be doing an histogram to see the distribution:

hist(daily_activity_sleep$total_minutes_asleep / 60,
 col = "lightgreen",
 main = "Distribution of Sleep Duration (Hours)",
 xlab = "Total Hours Asleep",
 ylab = "Frequency",
 breaks = 15)
hist(daily_activity_sleep$total_minutes_asleep / 60,
 col = "lightgreen",
 main = "Distribution of Sleep Duration (Hours)",
 xlab = "Total Hours Asleep",
 ylab = "Frequency",
 breaks = 15)

Output:

The histogram of total hours asleep shows that most users tend to sleep between 6 and 9 hours, with a peak around 7 hours. The distribution appears to be slightly skewed to the left, indicating that some users are getting less sleep than recommended. Notably, a few outliers sleep significantly less than 2 hours, which may indicate potential health issues. Overall, understanding these sleep patterns can help in promoting healthier habits among users, as sleep quality is crucial for recovery and overall fitness performance.


6: Act Phase

Conclusions & Recommendations

Based on the data analysis, it’s clear that there are specific moments where Bellabeat can engage with users more effectively.

Events & Challenges in the app

The average number of steps on Sundays is consistently below the recommended level, making Sundays a perfect target for Bellabeat’s notifications. These notifications could promote special challenges or campaigns to encourage more activity on weekends, bringing the user more closely to the brand, improving the relationship.

Optimal notifications

Additionally, workout reminders should be strategically timed for maximum impact. There’s a slight peak in activity just after lunch (12–14h), making this an ideal time to push notifications for quick workouts or stretching exercises. The highest peak of activity occurs between 17–19h, so a final reminder in the late afternoon would capitalize on users’ natural movement patterns.

Improving the users well being

Lastly, promoting healthier habits, especially for those users who sleep less than six hours, could be an important focus. By encouraging better sleep hygiene and educating users about the benefits of adequate rest, Bellabeat can help improve users’ overall well-being, further aligning the product with a holistic health approach.

Next Steps

There are some recommendations I have for future data collection:

1. Increase the User Base Sample

To get more accurate and reliable insights, we need to expand our dataset. The current sample size of 33 users is too small to represent a broader audience. Collecting data from a larger user base, like 5000 users, will give us a more comprehensive understanding of behavior trends and allow us to draw stronger conclusions.

2. Gather Information on Sports and Activities

Collecting detailed data on the types of sports or exercises users are engaging in would enable us to provide more personalized exercise recommendations. This way, Bellabeat could suggest workouts based on what users are already doing and track the effectiveness of these exercises, further enhancing the user experience.

3. Use More Recent Data

The current data from 2016 is outdated. To stay relevant and provide actionable insights, we need to collect data that reflects users’ current habits. Using more recent data will help us align our strategies with today’s market conditions and trends, allowing Bellabeat to stay competitive and responsive to user needs.

4. Extend the Data Collection Period

Right now, the data only spans a single month, which limits our understanding of long-term user behaviors. Expanding the sample to cover several months, or even an entire year, would allow us to analyze trends across different seasons. Seasonal changes can greatly affect user activity, and understanding these patterns would help us tailor our marketing and product strategies more effectively.

That is all, thank you very much for reading.

If you want to execute the code yourself, you can do so through this Kaggle link