Synopsis
This series aims to teach you the high level fundamentals of machine learning from A to Z. I'll teach you the basic intuition, algorithms, and math. We'll discuss languages and frameworks, deep learning, and more. Audio may be an inferior medium to task; but with all our exercise, commute, and chores hours of the day, not having an audio supplementary education would be a missed opportunity. And where your other resources will provide you the machine learning trees, Ill provide the forest. Additionally, consider me your syllabus. At the end of every episode Ill provide the best-of-the-best resources curated from around the web for you to learn each episodes details.
Episodes
-
MLA 013 Tech Stack for Customer-Facing Machine Learning Products
03/01/2021 Duration: 47minPrimary technology recommendations for building a customer-facing machine learning product include React and React Native for the front end, serverless platforms like AWS Amplify or GCP Firebase for authentication and basic server/database needs, and Postgres as the relational database of choice. Serverless approaches are encouraged for scalability and security, with traditional server frameworks and containerization recommended only for advanced custom backend requirements. When serverless options are inadequate, use Node.js with Express or FastAPI in Docker containers, and consider adding Redis for in-memory sessions and RabbitMQ or SQS for job queues, though many of these functions can be handled by Postgres. The machine learning server itself, including deployment strategies, will be discussed separately. Links Notes and resources at ocdevel.com/mlg/mla-13 Try a walking desk stay healthy & sharp while you learn & code Client Applications React is recommended as the primary web front-end fram
-
MLA 012 Docker for Machine Learning Workflows
09/11/2020 Duration: 31minDocker enables efficient, consistent machine learning environment setup across local development and cloud deployment, avoiding many pitfalls of virtual machines and manual dependency management. It streamlines system reproduction, resource allocation, and GPU access, supporting portability and simplified collaboration for ML projects. Machine learning engineers benefit from using pre-built Docker images tailored for ML, allowing seamless project switching, host OS flexibility, and straightforward deployment to cloud platforms like AWS ECS and Batch, resulting in reproducible and maintainable workflows. Links Notes and resources at ocdevel.com/mlg/mla-12 Try a walking desk stay healthy & sharp while you learn & code Traditional Environment Setup Challenges Traditional machine learning development often requires configuring operating systems, GPU drivers (CUDA, cuDNN), and specific package versions directly on the host machine. Manual setup can lead to version conflicts, resource allocation issue
-
MLG 032 Cartesian Similarity Metrics
08/11/2020 Duration: 41minTry a walking desk while studying ML or working on your projects! Show notes at ocdevel.com/mlg/32. L1/L2 norm, Manhattan, Euclidean, cosine distances, dot product Normed distances link A norm is a function that assigns a strictly positive length to each vector in a vector space. link Minkowski is generalized. p_root(sum(xi-yi)^p). "p" = ? (1, 2, ..) for below. L1: Manhattan/city-block/taxicab. abs(x2-x1)+abs(y2-y1). Grid-like distance (triangle legs). Preferred for high-dim space. L2: Euclidean. sqrt((x2-x1)^2+(y2-y1)^2. sqrt(dot-product). Straight-line distance; min distance (Pythagorean triangle edge) Others: Mahalanobis, Chebyshev (p=inf), etc Dot product A type of inner product. Outer-product: lies outside the involved planes. Inner-product: dot product lies inside the planes/axes involved link. Dot product: inner product on a finite dimensional Euclidean space link Cosine (normalized dot)
-
MLA 011 Practical Clustering Tools
08/11/2020 Duration: 34minPrimary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices. Links Notes and resources at ocdevel.com/mlg/mla-11 Try a walking desk stay healthy & sharp while you learn & code K-means Clustering K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks. The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets. K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces. When document embedding
-
MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK
28/10/2020 Duration: 26minThe landscape of Python natural language processing tools has evolved from broad libraries like NLTK toward more specialized packages such as Gensim for topic modeling, SpaCy for linguistic analysis, and Hugging Face Transformers for advanced tasks, with Sentence Transformers extending transformer models to enable efficient semantic search and clustering. Each library occupies a distinct place in the NLP workflow, from fundamental text preprocessing to semantic document comparison and large-scale language understanding. Links Notes and resources at ocdevel.com/mlg/mla-10 Try a walking desk stay healthy & sharp while you learn & code Historical Foundation: NLTK NLTK ("Natural Language Toolkit") was one of the earliest and most popular Python libraries for natural language processing, covering tasks from tokenization and stemming to document classification and syntax parsing. NLTK remains a catch-all "Swiss Army knife" for NLP, but many of its functions have been supplemented or superseded by newe
-
MLA 009 Charting and Visualization Tools for Data Science
06/11/2018 Duration: 24minPython charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding. Links Notes and resources at ocdevel.com/mlg/mla-9 Try a walking desk stay healthy & sharp while you learn & code Core Phases in Data Science Visualization Exploratory Data Analysis (EDA): EDA occupies an early stage in the Business Intelligence (BI) pipeline, positioned just before or sometimes merged with the data cleaning (“munging”) phase. The outputs of EDA (e.g., correlation matrices, histograms) often serve as inputs to subsequent machine learning steps. Python Visualization Libraries 1. M
-
MLA 008 Exploratory Data Analysis (EDA)
26/10/2018 Duration: 25minExploratory data analysis (EDA) sits at the critical pre-modeling stage of the data science pipeline, focusing on uncovering missing values, detecting outliers, and understanding feature distributions through both statistical summaries and visualizations, such as Pandas' info(), describe(), histograms, and box plots. Visualization tools like Matplotlib, along with processes including imputation and feature correlation analysis, allow practitioners to decide how best to prepare, clean, or transform data before it enters a machine learning model. Links Notes and resources at ocdevel.com/mlg/mla-8 Try a walking desk stay healthy & sharp while you learn & code EDA in the Data Science Pipeline Position in Pipeline: EDA is an essential pre-processing step in the business intelligence (BI) or data science pipeline, occurring after data acquisition but before model training. Purpose: The goal of EDA is to understand the data by identifying: Missing values (nulls) Outliers Feature distributions Relation
-
MLA 007 Jupyter Notebooks
16/10/2018 Duration: 16minJupyter Notebooks, originally conceived as IPython Notebooks, enable data scientists to combine code, documentation, and visual outputs in an interactive, browser-based environment supporting multiple languages like Python, Julia, and R. This episode details how Jupyter Notebooks structure workflows into executable cells - mixing markdown explanations and inline charts - which is essential for documenting, demonstrating, and sharing data analysis and machine learning pipelines step by step. Links Notes and resources at ocdevel.com/mlg/mla-7 Try a walking desk stay healthy & sharp while you learn & code Overview of Jupyter Notebooks Historical Context and Scope Jupyter Notebooks began as IPython Notebooks focused solely on Python. The project was renamed Jupyter to support additional languages - namely Julia ("JU"), Python ("PY"), and R ("R") - broadening its applicability for data science and machine learning across multiple languages. Interactive, Narrative-Driven Coding Jupyter Notebooks
-
MLA 006 Salaries for Data Science & Machine Learning
19/07/2018 Duration: 19minO'Reilly's 2017 Data Science Salary Survey finds that location is the most significant salary determinant for data professionals, with median salaries ranging from $134,000 in California to under $30,000 in Eastern Europe, and highlights that negotiation skills can lead to salary differences as high as $45,000. Other key factors impacting earnings include company age and size, job title, industry, and education, while popular tools and languages—such as Python, SQL, and Spark—do not strongly influence salary despite widespread use. Links Notes and resources at ocdevel.com/mlg/mla-6 Try a walking desk stay healthy & sharp while you learn & code Global and Regional Salary Differences Median Global Salary: $90,000 USD, up from $85,000 the previous year. Regional Breakdown: United States: $112,000 median; California leads at $134,000. Western Europe: $57,000—about half the US median. Australia & New Zealand: Second after the US. Eastern Europe: Below $30,000. Asia: Wide interquartile salary ran
-
MLA 005 Shapes and Sizes: Tensors and NDArrays
09/06/2018 Duration: 27minExplains the fundamental differences between tensor dimensions, size, and shape, clarifying frequent misconceptions—such as the distinction between the number of features (“columns”) and true data dimensions—while also demystifying reshaping operations like expand_dims, squeeze, and transpose in NumPy. Through practical examples from images and natural language processing, listeners learn how to manipulate tensors to match model requirements, including scenarios like adding dummy dimensions for grayscale images or reordering axes for sequence data. Links Notes and resources at ocdevel.com/mlg/mla-5 Try a walking desk stay healthy & sharp while you learn & code Definitions Tensor: A general term for an array of any number of dimensions. 0D Tensor (Scalar): A single number (e.g., 5). 1D Tensor (Vector): A simple list of numbers. 2D Tensor (Matrix): A grid of numbers (rows and columns). 3D+ Tensors: Higher-dimensional arrays, such as images or batches of images. NDArray (NumPy): Stands for "N-
-
MLA 003 Storage: HDF, Pickle, Postgres
24/05/2018 Duration: 17minPractical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options—explaining when to use HDF5, pickle files, or SQL databases—while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines. Links Notes and resources at ocdevel.com/mlg/mla-3 Try a walking desk stay healthy & sharp while you learn & code Data Ingestion and Preprocessing Data Sources and Formats: Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases. Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis). Pandas as the Cor
-
MLA 002 Numpy & Pandas
24/05/2018 Duration: 18minNumPy enables efficient storage and vectorized computation on large numerical datasets in RAM by leveraging contiguous memory allocation and low-level C/Fortran libraries, drastically reducing memory footprint compared to native Python lists. Pandas, built on top of NumPy, introduces labelled, flexible tabular data manipulation—facilitating intuitive row and column operations, powerful indexing, and seamless handling of missing data through tools like alignment, reindexing, and imputation. Links Notes and resources at ocdevel.com/mlg/mla-2 Try a walking desk stay healthy & sharp while you learn & code NumPy: Efficient Numerical Arrays and Vectorized Computation Purpose and Design NumPy ("Numerical Python") is the foundational library for handling large numerical datasets in RAM. It introduces the ndarray (n-dimensional array), which is synonymous with a tensor—enabling storage of vectors, matrices, or higher-dimensional data. Memory Efficiency NumPy arrays are homogeneous: all elements sha
-
MLA 001 Degrees, Certificates, and Machine Learning Careers
24/05/2018 Duration: 11minWhile industry-respected credentials like Udacity Nanodegrees help build a practical portfolio for machine learning job interviews, they remain insufficient stand-alone qualifications—most roles require a Master’s degree as a near-hard requirement, especially compared to more flexible web development fields. A Master’s, such as Georgia Tech’s OMSCS, not only greatly increases employability but is strongly recommended for those aiming for entry into machine learning careers, while a PhD is more appropriate for advanced, research-focused roles with significant time investment. Links Notes and resources at ocdevel.com/mlg/mla-1 Online Certificates: Usefulness and Limitations Udacity Nanodegree Provides valuable hands-on experience and a practical portfolio of machine learning projects. Demonstrates self-motivation and the ability to self-teach. Not industry-recognized as a formal qualification—does not by itself suffice for job placement in most companies. Best used as a supplement to demonstrate applied
-
MLG 029 Reinforcement Learning Intro
05/02/2018 Duration: 43minTry a walking desk while studying ML or working on your projects! Introduction to reinforcement learning concepts. ocdevel.com/mlg/29 for notes and resources.
-
MLG 028 Hyperparameters 2
04/02/2018 Duration: 51minTry a walking desk while studying ML or working on your projects! Hyperparameters part 2: hyper-search, regularization, SGD optimizers, scaling. ocdevel.com/mlg/28 for notes and resources
-
MLG 027 Hyperparameters 1
28/01/2018 Duration: 47minTry a walking desk while studying ML or working on your projects! Hyperparameters part 1: network architecture. ocdevel.com/mlg/27 for notes and resources
-
MLG 026 Project Bitcoin Trader
27/01/2018 Duration: 38minTry a walking desk while studying ML or working on your projects! Community project & intro to Bitcoin/crypto + trading. ocdevel.com/mlg/26 for notes and resources
-
MLG 025 Convolutional Neural Networks
30/10/2017 Duration: 44minTry a walking desk while studying ML or working on your projects! Convnets or CNNs. Filters, feature maps, window/stride/padding, max-pooling. ocdevel.com/mlg/25 for notes and resources
-
MLG 024 Tech Stack
07/10/2017 Duration: 01h01minTry a walking desk while studying ML or working on your projects! TensorFlow, Pandas, Numpy, Scikit-Learn, Keras, TensorForce. ocdevel.com/mlg/24 for notes and resources
-
MLG 023 Deep NLP 2
20/08/2017 Duration: 43minTry a walking desk while studying ML or working on your projects! RNN review, bi-directional RNNs, LSTM & GRU cells. ocdevel.com/mlg/23 for notes and resources