Data Preprocessing: Whitening or Sphering in Python

Outline of the post:

  • What is Whitening or Sphering? Why?
  • Steps to Whiten a dataset
  • Mathematical intuition
  • Implementation of Whitening in Python

What is Whitening or Sphering? Why?

“A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they are uncorrelated and each have variance.The transformation is called “whitening” because it changes the input vector into a white noise vector.”

Whitening Transformation, Wikipedia, 2020

Whitening or Sphering is a data pre-processing step. It can be used to remove correlation or dependencies between features in a dataset. This may help to better train a machine learning model. Figure 1 below shows the effect of decorrelation and whitening on a data that has correlated features (‘orange’ color dots).

Figure 1: Example plots showing correlated features (left), decorrelated features (middle) and Whitening or Sphering (right).

Steps to Whiten a dataset

Eigenvalues and eigenvectors are first calculated from the covariance of a zero centered data set. These are then used for Whitening the data using either PCA (principal component analysis) or ZCA (zero component analysis method).

  • Step # 1: Find if data has one feature per row or one feature per column
  • Step # 2: Zero-center the dataset
  • Step # 3: Calculate the Covariance matrix using the zero-centered dataset
  • Step # 4: Calculate the Eigenvalues and Eigenvectors
  • Step # 5: Apply the Eigenvalues and Eigenvectors to the data for whitening transform.

Mathematical Intuition

Zero-center data:

Equation 1: Zero center using mean

where,

  • xcentered: centered data for a feature
  • x: is individual data point for a feature
  • xmean: mean value for a feature

Covariance matrix:

Equation 2: Variance (biased)
Equation 3: Covariance (biased)
Equation 4: Variance (unbiased)
Equation 5: Covariance (unbiased)

where:

  • xvariance: is variance for a feature
  • xcovariance: is covariance between features x and y
  • xi and yi: is individual data point for feature x and y
  • xmean and ymean: mean value for features x and y
  • : indicates sum of values
  • n: is the number of observations for a feature

Eigenvalues and Eigenvectors:

Equation 6: Substitute covariance matrix for ‘A’ and solve the simultaneous equations to calculate ‘λ’ i.e. Eigenvalues
Equation 7: Substitute the Eigenvalues ‘λ’ from above Equation#6 along with the covariance matrix ‘A’ to calculate the Eigenvectors.

where:

  • A: Covariance matrix of features
  • λ: is the matrix for Eigenvalues
  • I: is identity matrix
  • v: is the matrix for Eigenvectors

Whitening using PCA:

Equation 8: Whitening transformation using PCA

where:

  • X: Feature data
  • Λ: is the matrix for Eigenvalues (above is inverse square root)
  • UT: is transpose of matrix for Eigenvectors

Whitening using ZCA:

Equation 9: Whitening transformation using ZCA.

where:

  • X: Feature data
  • Λ: is the matrix for Eigenvalues (above is inverse square root)
  • UT: is transpose of matrix for Eigenvectors

Implementation of Whitening in Python

Example # 1: (with 5 data points)

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg

# Create data
x = np.array([[1,2,3,4,5],  # Feature-1: Height
              [11,12,13,14,15]]) # Feature-2: Weight
print('x.shape:', x.shape)

# Center data
# By subtracting mean for each feature
xc = x.T - np.mean(x.T, axis=0)
xc = xc.T
print('xc.shape:', xc.shape, '\n')

# Calculate covariance matrix
xcov = np.cov(xc, rowvar=True, bias=True)
print('Covariance matrix: \n', xcov, '\n')

# Calculate Eigenvalues and Eigenvectors
w, v = linalg.eig(xcov)
# Note: Use w.real.round(4) to (1) remove 'j' notation to real, (2) round to '4' significant digits
print("Eigenvalues:\n", w.real.round(4), '\n')
print("Eigenvectors:\n", v, '\n')

# Calculate inverse square root of Eigenvalues
# Optional: Add '.1e5' to avoid division errors if needed
# Create a diagonal matrix
diagw = np.diag(1/(w**0.5)) # or np.diag(1/((w+.1e-5)**0.5))
diagw = diagw.real.round(4) #convert to real and round off
print("Diagonal matrix for inverse square root of Eigenvalues:\n", diagw, '\n')

# Calculate Rotation (optional)
# Note: To see how data can be rotated
xrot = np.dot(v, xc)

# Whitening transform using PCA (Principal Component Analysis)
wpca = np.dot(np.dot(diagw, v.T), xc)

# Whitening transform using ZCA (Zero Component Analysis)
wzca = np.dot(np.dot(np.dot(v, diagw), v.T), xc)

# Plot zero-centered, rotated and whitened data
plt.scatter(xc[0,:], xc[1,:], s=150, label='centered', alpha=0.5)
plt.scatter(xrot[0,:], xrot[1,:], s=150, label='rotated')
plt.scatter(wpca[0,:], wpca[1,:], s=150, marker='+', label='wpca')
plt.scatter(wzca[0,:], wzca[1,:], s=150, marker='x', label='wzca')
plt.xlabel('Height', fontsize=16)
plt.ylabel('Weight', fontsize=16)
plt.rc('xtick',labelsize=16)
plt.rc('ytick',labelsize=16)
plt.legend()
plt.tight_layout()
plt.savefig('whiten_1.png', dpi=300)
x.shape: (2, 5)
xc.shape: (2, 5) 

Covariance matrix: 
 [[2. 2.]
 [2. 2.]] 

Eigenvalues:
 [4. 0.] 

Eigenvectors:
 [[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]] 

Diagonal matrix for inverse square root of Eigenvalues:
 [[5.00000000e-01 0.00000000e+00]
 [0.00000000e+00 4.74531328e+07]] 
Figure 2: The plot shows the original centered data, rotated data as well as the whitening transformed data using PCA and ZCA

Example # 2: (1000 data points)

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg

# Create data
np.random.seed(1)
mu = [0,0]
sigma = [[6,5], [5,6]]
n = 1000
x = np.random.multivariate_normal(mu, sigma, size=n)
print('x.shape:', x.shape, '\n')

# Zero center data
xc = x - np.mean(x, axis=0)
print(xc.shape)
xc = xc.T
print('xc.shape:', xc.shape, '\n')

# Calculate Covariance matrix
# Note: 'rowvar=True' because each row is considered as a feature
# Note: 'bias=True' to divide the sum of squared variances by 'n' instead of 'n-1'
xcov = np.cov(xc, rowvar=True, bias=True)
print

# Calculate Eigenvalues and Eigenvectors
w, v = linalg.eig(xcov) # .eigh()
# Note: Use w.real.round(4) to (1) remove 'j' notation to real, (2) round to '4' significant digits
print("Eigenvalues:\n", w.real.round(4), '\n')
print("Eigenvectors:\n", v, '\n')

# Calculate inverse square root of Eigenvalues
# Optional: Add '.1e5' to avoid division errors if needed
# Create a diagonal matrix
diagw = np.diag(1/(w**0.5)) # or np.diag(1/((w+.1e-5)**0.5))
diagw = diagw.real.round(4) #convert to real and round off
print("Diagonal matrix for inverse square root of Eigenvalues:\n", diagw, '\n')

# Calculate Rotation (optional)
# Note: To see how data can be rotated
xrot = np.dot(v, xc)

# Whitening transform using PCA (Principal Component Analysis)
wpca = np.dot(np.dot(diagw, v.T), xc)

# Whitening transform using ZCA (Zero Component Analysis)
wzca = np.dot(np.dot(np.dot(v, diagw), v.T), xc)


fig = plt.figure(figsize=(10,3))

plt.subplot(1,4,1)
plt.scatter(x[:,0], x[:,1], label='original', alpha=0.15)
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
plt.legend()

plt.subplot(1,4,2)
plt.scatter(xc[0,:], xc[1,:], label='centered', alpha=0.15)
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
plt.legend()

plt.subplot(1,4,3)
plt.scatter(wpca[0,:], wpca[1,:], label='wpca', alpha=0.15)
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
plt.legend()

plt.subplot(1,4,4)
plt.scatter(wzca[0,:], wzca[1,:], label='wzca', alpha=0.15)
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
plt.legend()

plt.tight_layout()
plt.savefig('whiten_2.png', dpi=300)

x.shape: (1000, 2) 

(1000, 2)
xc.shape: (2, 1000) 

Eigenvalues:
 [ 1.005  11.1862] 

Eigenvectors:
 [[-0.71582163 -0.69828318]
 [ 0.69828318 -0.71582163]] 

Diagonal matrix for inverse square root of Eigenvalues:
 [[0.9975 0.    ]
 [0.     0.299 ]] 
# Overlay plot
plt.scatter(xc[0,:], xc[1,:], label='centered', alpha=0.15)
plt.scatter(wpca[0,:], wpca[1,:], marker='+',label='wpca', alpha=0.5)
plt.scatter(wzca[0,:], wzca[1,:], marker='x', label='wzca', alpha=0.5)
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
plt.legend()

plt.tight_layout()
plt.savefig('whiten_3.png', dpi=300)

.

Check out the explanation on YouTube video!

Leave a comment