$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Volume Forms and Probability Density Functions Under Change of Variables

From elementary probability theory, it is well known that a probability density function (pdf) is not invariant under an arbitrary change of variables (reparametrization). In this article we'll see that pdf are actually invariant when we see a pdf in its entirety, as a volume form and a Radon-Nikodym derivative in differential geometry.

The Invariance of the Hessian and Its Eigenvalues, Determinant, and Trace

In deep learning, the Hessian and its downstream quantities are observed to be not invariant under reparametrization. This makes the Hessian to be a poor proxy for flatness and makes Newton's method non-invariant. In this post, we shall see that the Hessian and the quantities derived from it are actually invariant under reparametrization.

Convolution of Gaussians and the Probit Integral

Gaussian distributions are very useful in Bayesian inference due to their (many!) convenient properties. In this post we take a look at two of them: the convolution of two Gaussian pdfs and the integral of the probit function w.r.t. a Gaussian measure.

The Last Mile of Creating Publication-Ready Plots

In machine learning papers, plots are often treated as afterthought---authors often simply use the default Matplotlib style, resulting in an out-of-place look when the paper is viewed as a whole. In this post, I'm sharing how I make my publication-ready plots using TikZ.

Modern Arts of Laplace Approximations

The Laplace approximation (LA) is a simple yet powerful class of methods for approximating intractable posteriors. Yet, it is largely forgotten in the Bayesian deep learning community. Here, we review the LA, and highlight a recent software library for applying LA to deep nets.

Chentsov's Theorem

The Fisher information is often the default choice of the Riemannian metric for manifolds of probability distributions. In this post, we study Chentsov's theorem, which justifies this choice. It says that the Fisher information is the unique Riemannian metric (up to a scaling constant) that is invariant under sufficient statistics. This fact makes the Fisher metric stands out from other choices.

The Curvature of the Manifold of Gaussian Distributions

The Gaussian probability distribution is central in statistics and machine learning. As it turns out, by equipping the set of all Gaussians p.d.f. with a Riemannian metric given by the Fisher information, we can see it as a Riemannian manifold. In this post, we will prove that this manifold can be covered by a single coordinate chart and has a constant negative curvature.

Hessian and Curvatures in Machine Learning: A Differential-Geometric View

In machine learning, especially in neural networks, the Hessian matrix is often treated synonymously with curvatures. But, from calculus alone, it is not clear why can one say so. Here, we will view the loss landscape of a neural network as a hypersurface and apply a differential-geometric analysis on it.

Optimization and Gradient Descent on Riemannian Manifolds

One of the most ubiquitous applications in the field of differential geometry is the optimization problem. In this article we will discuss the familiar optimization problem on Euclidean spaces by focusing on the gradient descent method, and generalize them on Riemannian manifolds.

Notes on Riemannian Geometry

This article is a collection of small notes on Riemannian geometry that I find useful as references. It is largely based on Lee's books on smooth and Riemannian manifolds.

Minkowski's, Dirichlet's, and Two Squares Theorem

Application of Minkowski's Theorem in geometry problems, Dirichlet's Approximation Theorem, and Two Squares Theorem.

Reduced Betti number of sphere: Mayer-Vietoris Theorem

A proof of reduced homology of sphere with Mayer-Vietoris sequence.

Brouwer's Fixed Point Theorem: A Proof with Reduced Homology

A proof of special case (ball) of Brouwer's Fixed Point Theorem with Reduced Homology.

Natural Gradient Descent

Intuition and derivation of natural gradient descent.

Fisher Information Matrix

An introduction and intuition of Fisher Information Matrix.

Introduction to Annealed Importance Sampling

An introduction and implementation of Annealed Importance Sampling (AIS).

Gibbs Sampler for LDA

Implementation of Gibbs Sampler for the inference of Latent Dirichlet Allocation (LDA)

Boundary Seeking GAN

Training GAN by moving the generated samples to the decision boundary.

Least Squares GAN

2017 is the year GAN loss its logarithm. First, it was Wasserstein GAN, and now, it's LSGAN's turn.

CoGAN: Learning joint distribution with GAN

Original GAN and Conditional GAN are for learning marginal and conditional distribution of data respectively. But how can we extend them to learn joint distribution instead?

Wasserstein GAN implementation in TensorFlow and Pytorch

Wasserstein GAN comes with promise to stabilize GAN training and abolish mode collapse problem in GAN.

InfoGAN: unsupervised conditional GAN in TensorFlow and Pytorch

Adding Mutual Information regularization to a GAN turns out gives us a very nice effect: learning data representation and its properties in unsupervised manner.

Maximizing likelihood is equivalent to minimizing KL-Divergence

We will show that doing MLE is equivalent to minimizing the KL-Divergence between the estimator and the true distribution.

Variational Autoencoder (VAE) in Pytorch

With all of those bells and whistles surrounding Pytorch, let's implement Variational Autoencoder (VAE) using it.

Generative Adversarial Networks (GAN) in Pytorch

Pytorch is a new Python Deep Learning library, derived from Torch. Contrary to Theano's and TensorFlow's symbolic operations, Pytorch uses imperative programming style, which makes its implementation more "Numpy-like".

Theano for solving Partial Differential Equation problems

We all know Theano as a forefront library for Deep Learning research. However, it should be noted that Theano is a general purpose numerical computing library, like Numpy. Hence, in this post, we will look at the implementation of PDE simulation in Theano.

Linear Regression: A Bayesian Point of View

You know the drill, apply mean squared error, then descend those gradients. But, what is the intuition of that process in Bayesian PoV?

MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation

In this post, we will see what is the difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP).

Conditional Generative Adversarial Nets in TensorFlow

Having seen GAN, VAE, and CVAE model, it is only proper to study the Conditional GAN model next!

KL Divergence: Forward vs Reverse?

KL Divergence is a measure of how different two probability distributions are. It is a non-symmetric distance function, and each arrangement has its own interesting property, especially when we use it in optimization settings e.g. Variational Bayes method.

Conditional Variational Autoencoder: Intuition and Implementation

An extension to Variational Autoencoder (VAE), Conditional Variational Autoencoder (CVAE) enables us to learn a conditional distribution of our data, which makes VAE more expressive and applicable to many interesting things.

Variational Autoencoder: Intuition and Implementation

Variational Autoencoder (VAE) (Kingma et al., 2013) is a new perspective in the autoencoding business. It views Autoencoder as a bayesian inference problem: modeling the underlying probability distribution of data.

Deriving Contractive Autoencoder and Implementing it in Keras

Contractive Autoencoder is more sophisticated kind of Autoencoder compared to the last post. Here, we will dissect the loss function of Contractive Autoencoder and derive it so that we could implement it in Keras.

Many flavors of Autoencoder

Autoencoder is a family of methods that answers the problem of data reconstruction using neural net. There are several variation of Autoencoder: sparse, multilayer, and convolutional. In this post, we will look at those different kind of Autoencoders and learn how to implement them with Keras.

Level Set Method Part II: Image Segmentation

Level Set Method is an interesting classical (pre deep learning) Computer Vision method based on Partial Differential Equation (PDE) for image segmentation. In this post, we will look at its application in image segmentation.

Level Set Method Part I: Introduction

Level Set Method is an interesting classical (pre deep learning) Computer Vision method based on Partial Differential Equation (PDE) for image segmentation. In this post, we will look at the intuition behind it.

Residual Net

In this post, we will look into the record breaking convnet model of 2015: the Residual Net (ResNet).

Generative Adversarial Nets in TensorFlow

Let's try to implement Generative Adversarial Nets (GAN), first introduced by Goodfellow et al, 2014, with TensorFlow. We'll use MNIST data to train the GAN!

How to Use Specific Image and Description when Sharing Jekyll Post to Facebook

Normally, random subset of pictures and the site's description will be picked when we shared our Jekyll blog post URL to Facebook. This is how to force Facebook to use the specific image and description for our blog post!

Deriving LSTM Gradient for Backpropagation

Deriving neuralnet gradient is an absolutely great exercise to understand backpropagation and computational graph better. In this post we will walk through the process of deriving LSTM net gradient so that we can use it in backpropagation.

Convnet: Implementing Maxpool Layer with Numpy

Another important building block in convnet is the pooling layer. Nowadays, the most widely used is the max pool layer. Let's dissect its Numpy implementation!

Convnet: Implementing Convolution Layer with Numpy

Convnet is dominating the world of computer vision right now. What make it special of course the convolution layer, hence the name. Let's study it further by implementing it from scratch using Numpy!

Implementing BatchNorm in Neural Net

BatchNorm is a relatively new technique for training neural net. It gaves us a lot of relaxation when initializing the network and accelerates training.

Implementing Dropout in Neural Net

Dropout is one simple way to regularize a neural net model. This is one of the recent advancements in Deep Learning that makes training deeper and deeper neural net tractable.

Beyond SGD: Gradient Descent with Momentum and Adaptive Learning Rate

There are many attempts to improve Gradient Descent: some add momentum, some add adaptive learning rate. Let's see what's out there in the realm of neural nets optimization.

Implementing Minibatch Gradient Descent for Neural Networks

Let's use Python and Numpy to implement Minibatch Gradient Descent algorithm for a simple 3-layers Neural Networks.

Paralellizing Monte Carlo Simulation in Python

Monte Carlo simulation is all about quantity. It can take a long time to complete. Here's how to speed it up with the amazing Python multiprocessing module!

Scrapy as a Library in Long Running Process

Scrapy is a great web crawler framework, but it's tricky to make it runs as a library in a long-running process. Here's how!

Gaussian Anomaly Detection

In Frequentist and Bayesian Way

Slice Sampling

An implementation example of Slice Sampling for a special case: unimodal distribution with known inverse PDF

Rejection Sampling

Rejection is always painful, but it's for the greater good! You can sample from a complicated distribution by rejecting samples!


An implementation example of Metropolis-Hastings algorithm in Python.

Gibbs Sampling

Example of Gibbs Sampling implementation in Python to sample from a Bivariate Gaussian.

Twitter Authentication with Tweepy and Flask

A tutorial on how to do Twitter OAuth authentication in Flask web application.

Deploying Wagtail App

In this post, I'll show you how to deploy our blog and how to solve some common problems when deploying Wagtail app.

Developing Blog with Wagtail

My experience on building this blog using Wagtail CMS, with zero Django knowledge. Let’s code our blog!

Setting Up Wagtail Development Environment

My experience on building a blog using Wagtail CMS, with zero Django knowledge. I’ll walk you through from scratch up until the blog is live