What are pandas and NumPy?
NumPy is the numerical array library everything in ML is built on. pandas sits on top for tabular data — loading CSVs, cleaning columns, grouping, and joining. Together they are how you wrangle raw data into the clean arrays a model can learn from.
Why it matters
Real ML projects spend most of their time on data, not models. The unglamorous work of loading, cleaning, and reshaping is where projects succeed or fail. Fluency in pandas and NumPy is the day-to-day reality of the job, far more than designing novel architectures.
What to learn
- NumPy arrays and vectorized operations
- pandas Series and DataFrames
- Reading and writing CSV, JSON, and Parquet
- Selecting, filtering, and transforming columns
- Handling missing data
- Grouping and aggregating
- Merging and joining datasets
Common pitfall
Writing Python loops over rows of a DataFrame instead of vectorized operations. Row-by-row loops are dramatically slower and harder to read. pandas and NumPy are built to operate on whole columns and arrays at once — reach for vectorized methods, and treat an explicit row loop as a sign you missed a better way.
Resources
Primary (free):
- pandas — Getting started · docs
- NumPy — Absolute beginners · docs
- Kaggle — Pandas course · course
Practice
Load a real CSV dataset into pandas, handle its missing values, create a new column from existing ones without a loop, and compute a grouped summary. Join it with a second small table. Done when every transformation is vectorized rather than a row-by-row loop.
Outcomes
- Load and save data across common formats.
- Clean missing data and transform columns.
- Group, aggregate, and join datasets.
- Prefer vectorized operations over row loops.