Python & R Resources

For anyone dipping their toes into the world of code-first data analysis, the best thing you could do for yourself is to learn SQL. This is the most widely used tool in the corporate world for interacting with system data stored in enterprise data warehouses. If you master SQL, you’ve mastered most of the transformations that are possible for tabular, rectangular data. That said, I will not cover SQL resources here as I rarely write raw SQL anymore - It’s just not needed when Python and R can establish data warehouse connections easily and it allows me to stay within the world of Python or R for end-to-end data analysis including data collection, manipulation, machine learning, GenAI, etc.

Python and R are open-source programming languages for data analysis, applied statistical programming, machine learning, working with REST APIs, developing ultra-fast web apps, and generalized software development. One of the things that I like most about both languages is the thousands of packages available making almost everything you want to do in Python or R just a little easier from ETL, to method chaining, to creating interactive content. I certainly welcome any suggestions that you might have for the lists below, thanks!

Python Books

The Quick Python Book (3e): This book by Naomi Ceder is a few years old now (2018) but it’s the best end-to-end intro on Python that I’ve yet read taking you from basic classes / structures to function writing to working with modules
Python Data Science Handbook: Introduction to the core libraries essential for working with data in Python
Effective Pandas 2: Opinionated Patterns for Data Manipulation: After a wildly successful 1st edition, Matt Harrison is back with the revised 2nd edition with easy to follow tutorials for mastering the popular Pandas library
Tidy Finance with Python: This is one of my favorite newer books covering complex financial modeling, valuation, and pricing and represents “an opinionated approach to empirical research in financial economics [with an] open-source code base”

Python Packages

NumPy: Brings the computational power of C and Fortran to Python programmers for applying high-level mathematical functions to arrays and more
Pandas: This is the most popular package for data manipulation and analysis with extended operations available for tabular and time series data
uv: A modern, high-performance Python package manager / installer written in Rust that serves as a drop-in replacement for traditional Python package management tools like pip
Narwhals: Extremely lightweight compatibility layer between dataframe libraries with support for pandas, Polars, PyArrow, DuckDB, Ibis, Modin, Vaex, and anything which implements the DataFrame Interchange Protocol
scikit-learn: Built on top of NumPy, SciPy, and matplotlib, “sklearn” makes the development of predictive analysis workflows a simple and reproducible process
langchain: Simplify the creation of applications powered by LLMs, offering modular components for prompt management, memory, and chaining operations
huggingface: A comprehensive platform and community dedicated to advancing state-of-the-art machine learning, offering tools, datasets, and models for seamless integration into applications
chromadb: Chroma is the open-source AI application database that makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs

R Books: Classics

R for Data Science (2e): Phenomenal introduction to R, the RStudio IDE, and the tidyverse collection of packages
Advanced R (2e): Covers concepts, methods, and advanced object-oriented structures for R
Mastering Shiny: Designed to teach the foundations of Shiny for web development and more advanced concepts such as the introduction of modules to the Shiny framework
R Packages (2e): The definitive reference point for R package development “covering workflow and process, alongside the presentation of all the important moving parts that make up an R package”

R Books: Applied Resources

Tidy Modeling with R: Over the last few months, I’ve learned a lot from this A to Z resource on predictive modeling workflows using the tidymodels framework
Supervised Machine Learning for Text Analysis in R: This book is a masterclass in natural language processing taking you from the basics of NLP to real-life applications including inference and prediction
Forecasting Principles and Practice (3e): Said best by the author, “The book is written for three audiences: (1) people finding themselves doing forecasting in business when they may not have had any formal training in the area; (2) undergraduate students studying business; (3) MBA students doing a forecasting elective”
Deep Learning with R (2e): In-depth introduction to artificial intelligence and deep learning applications with R using the Keras library (Note: This book hasn’t been updated since the release of Keras3, but the concepts found in the book are still deep learning foundations)
Tidy Finance with R: This is one of my favorite newer books covering complex financial modeling, valuation, and pricing and represents “an opinionated approach to empirical research in financial economics [with an] open-source code base in multiple programming languages”
Feature Engineering A-Z: A reference guide to nearly all feature engineering methods you will encounter with examples in R and Python

R Packages

tidyverse: A collection of packages for data manipulation and functional programming (I use dplyr, stringr, and purrr on a daily basis)
tidymodels: Hands-down my preferred collection of packages for building reproducible machine learning recipes, workflows, model tuning, model stacking, and cross-validation
DT: This is an R implementation of the popular DataTables JavaScript library that lets you build polished, configurable tables for use in web reports, slides, and Shiny apps
bslib: A modern UI toolkit for Shiny and Quarto based on Bootstrap that can be easily customized with Bootswatch theming and Sass variables.
leaflet: R implementation of the popular Leaflet JavaScript library for developing interactive maps
embed: One of my go-to machine learning packages for data pre-processing with extra steps that are intended to be used with the recipes framework for embedding categorical and numeric predictors into one or more numeric columns
renv: This package helps you create reproducible environments for your R projects to more easily manage package dependencies on your operating system vs your project directory

Language Agnostic Frameworks

Arrow: Apache Arrow is a columnar memory format for flat and hierarchical data, organized for efficient analytic operations, supporting zero-copy reads for lightning-fast data access without serialization overhead
DuckDB: DuckDB is an in-process SQL OLAP database management system capable of larger than memory tabular data processing and with a growing list of community extensions
Polars: Polars is a lightning fast DataFrame library/in-memory query engine written in Rust and built upon the Arrow specification - It’s a great tool for efficient data wrangling, data pipelines, snappy APIs and much more
Quarto: An open-source scientific and technical publishing system that can be used with Python, R, Julia, and Observable to develop and deploy reproducible content such as articles, presentations, dashboards, websites, blogs, and e-books
Shiny: Easily create highly interactive web apps with Python or R without the need for formal web app development skills
Spark: Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters