Data Science

Big Data Analysis of NYC Taxi Trip Records

Project Overview

This project involved analyzing large-scale New York City Taxi & Limousine Commission (TLC) trip data using big data technologies. The goal was to process millions of taxi trip records to uncover patterns in ride demand, trip behavior, and revenue trends, while applying distributed computing techniques to manage and analyze high-volume datasets.

The project demonstrates my ability to work with real-world “big data” using cloud-based analytics platforms and modern data science tools.

Skills Demonstrated

This project highlights my abilities in:

Big data processing
Distributed computing with Spark
Data cleaning at scale
PySpark and SQL analytics
Exploratory data analysis
Cloud-based data science workflows

Data and Methodology

The analysis was performed on publicly available NYC TLC trip records, which contain detailed information about:

Pickup and drop-off locations
Trip distance and duration
Fares and tips
Time and date of travel
Ride types and vehicle categories

Because of the massive size of the dataset, traditional tools were insufficient. Instead, the project utilized:

Apache Spark and Databricks for distributed data processing
PySpark SQL for querying and transforming data
Python for additional analysis and visualization

Key Findings

Through this analysis, several important insights were uncovered:

Peak ride demand occurs during specific hours and days
Trip distances and fares vary significantly by location
Revenue patterns reflect commuting and tourism behavior
Seasonal trends impact overall taxi usage

These findings illustrate how large transportation datasets can be used to support urban planning, business strategy, and operational decision-making.

Project Links

Jupyter Notebook & Code: Github

Page updated

Google Sites

Report abuse