FoundationsBeginner10h

pandas & NumPy.

Loading, cleaning, and reshaping data the ML way.

What are pandas and NumPy?

NumPy is the numerical array library everything in ML is built on. pandas sits on top for tabular data — loading CSVs, cleaning columns, grouping, and joining. Together they are how you wrangle raw data into the clean arrays a model can learn from.

Why it matters

Real ML projects spend most of their time on data, not models. The unglamorous work of loading, cleaning, and reshaping is where projects succeed or fail. Fluency in pandas and NumPy is the day-to-day reality of the job, far more than designing novel architectures.

What to learn

  • NumPy arrays and vectorized operations
  • pandas Series and DataFrames
  • Reading and writing CSV, JSON, and Parquet
  • Selecting, filtering, and transforming columns
  • Handling missing data
  • Grouping and aggregating
  • Merging and joining datasets

Common pitfall

Writing Python loops over rows of a DataFrame instead of vectorized operations. Row-by-row loops are dramatically slower and harder to read. pandas and NumPy are built to operate on whole columns and arrays at once — reach for vectorized methods, and treat an explicit row loop as a sign you missed a better way.

Resources

Primary (free):

Practice

Load a real CSV dataset into pandas, handle its missing values, create a new column from existing ones without a loop, and compute a grouped summary. Join it with a second small table. Done when every transformation is vectorized rather than a row-by-row loop.

Outcomes

  • Load and save data across common formats.
  • Clean missing data and transform columns.
  • Group, aggregate, and join datasets.
  • Prefer vectorized operations over row loops.
Back to AI / ML roadmap