Big Data Analysis of NYC Taxi Trip Records
This project involved analyzing large-scale New York City Taxi & Limousine Commission (TLC) trip data using big data technologies. The goal was to process millions of taxi trip records to uncover patterns in ride demand, trip behavior, and revenue trends, while applying distributed computing techniques to manage and analyze high-volume datasets.
The project demonstrates my ability to work with real-world “big data” using cloud-based analytics platforms and modern data science tools.
This project highlights my abilities in:
Big data processing
Distributed computing with Spark
Data cleaning at scale
PySpark and SQL analytics
Exploratory data analysis
Cloud-based data science workflows
The analysis was performed on publicly available NYC TLC trip records, which contain detailed information about:
Pickup and drop-off locations
Trip distance and duration
Fares and tips
Time and date of travel
Ride types and vehicle categories
Because of the massive size of the dataset, traditional tools were insufficient. Instead, the project utilized:
Apache Spark and Databricks for distributed data processing
PySpark SQL for querying and transforming data
Python for additional analysis and visualization
Through this analysis, several important insights were uncovered:
Peak ride demand occurs during specific hours and days
Trip distances and fares vary significantly by location
Revenue patterns reflect commuting and tourism behavior
Seasonal trends impact overall taxi usage
These findings illustrate how large transportation datasets can be used to support urban planning, business strategy, and operational decision-making.