Why Do We Need to Look at Distributions before setting up a Model ?¶
As many Data Scientists know, in the process of modelling, if you are not one of "Algoritm Data Scientist" who puts the data directly into ML algorithm and get results, the data preparation, checking nulls, outliers, imputing them or discarding (regard to your data), statistical assumption checkings (tests etc.), plotting distributions, feature engineering, feature selection and other works take on average %70 of time.
So apart from other data steps which I mentioned above, I think crucial first step is to look your data. So you should plot the your variables and get their density plots.
Why ?
Because once you have an idea about what is your data tell you, you can prepare it (regarding to your business problem) for modelling with some ML algorithms even linear or non-linear ones (if you know assumptions of these algorithms too).
So Lets dive into a data which I created artificially below and interpret what does it tells us..
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
from scipy.stats import boxcox
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.neighbors import KernelDensity
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import shapiro
from sklearn.linear_model import LinearRegression
# Seed for reproducibility
np.random.seed(42)
# Create correlated variables
# Base Gaussian variable
base_gaussian = np.random.normal(loc=0, scale=1, size=500)
# Targets
# Right skewed target correlated with base_gaussian (add noise)
target_skewed = np.exp(base_gaussian + np.random.normal(scale=0.5, size=500))
# Gaussian target correlated with base_gaussian (add noise)
target_gaussian = base_gaussian + np.random.normal(scale=0.5, size=500)
# Independent variables
# One Gaussian correlated with base_gaussian
X1 = base_gaussian + np.random.normal(scale=0.5, size=500)
# Two non-parametric random variables (less correlated)
X2 = np.random.uniform(low=-3, high=3, size=500)
X3 = np.random.beta(a=2, b=5, size=500) * 6 - 3
# Create an arbitrary empirical distribution (non-parametric)
empirical_data = np.concatenate([
np.random.normal(0, 1, 200),
np.random.uniform(3, 5, 150),
np.random.beta(2, 5, 150) * 5
])
# Sample from this empirical distribution with replacement
X4 = np.random.choice(empirical_data, size=500, replace=True)
# Combine features into a matrix
X = np.vstack([X1, X2, X3, X4]).T
# Plot distributions of target variables and features
fig, axs = plt.subplots(2, 4, figsize=(18, 10))
sns.histplot(target_skewed, kde=True, ax=axs[0, 0], color='orange')
axs[0, 0].set_title('Right Skewed Target Distribution')
sns.histplot(target_gaussian, kde=True, ax=axs[0, 1], color='blue')
axs[0, 1].set_title('Gaussian Target Distribution')
sns.histplot(X1, kde=True, ax=axs[0, 2], color='green')
axs[0, 2].set_title('Gaussian Feature X1 Distribution')
sns.histplot(X2, kde=True, ax=axs[0, 3], color='purple')
axs[0, 3].set_title('Uniform Feature X2 Distribution')
sns.histplot(X3, kde=True, ax=axs[1, 0], color='brown')
axs[1, 0].set_title('Beta Feature X3 Distribution')
sns.histplot(X4, kde=True, ax=axs[1, 1], color='magenta')
axs[1, 1].set_title('Non Parametric Feature X4 Distribution')
# Hide unused subplots
axs[1, 2].axis('off')
axs[1, 3].axis('off')
plt.tight_layout()
plt.show()
So when you look at the graphs, I created two target variable one with right-skewed form and other is in Gaussian form. The four independent variables(predictors) have some gaussian, and non-gaussian forms.
- So Can I apply a simple linear regression model to them ?
- What are the assumptions of Linear Regression ?
- I will not mention all of the assumptions here but two:
- 1-) Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.
- 2-) Another is the linear relationship between target variable and predictors so We should expect Gaussian Distribution when we modeled in residuals, this is the important part.
- What these graphs tell us ?
- Transformation (right-skewed target variable) ?
- Outlier handling ?
- Imputation and other ideas ?
- Discarding uniform variable ?
All of this to be answered yourself to apply your modelling.
There are some parametric ones (for example X1, X2, X3) and non-parametric one (X4). Why I am talking about this ? Because in real life scenarios, we dont deal with known parametric distributions but highly non-parametric ones.
So how do we use these graphs ?
Here "Kernel Density Estimation" comes to play.
So What is "Kernel Density Estimation" ?
1-) What is a Non-Parametric Distribution?
- A non-parametric distribution means we don’t assume any predefined functional form (like Gaussian, exponential, etc.) for the distribution of the variable. Instead, we let the data speak for itself.
2-) How to Estimate the Density Function Non-Parametrically ?
- Using Kernel Density Estimation (KDE).
- KDE is a smoothing technique that estimates the probability density function (PDF) from data without assuming a parametric form.
The KDE at a point ( x ) is defined as:
$$ \hat{f}(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right) $$ where:
- ( n ) is the number of data points,
- ( h ) is the bandwidth (smoothing parameter
- ( K ) is the kernel function (e.g., Gaussian kernel),
- ( x_i ) are the observed data points.
3-) What is "x" here ?
- lets say you have some data : data = [2.1, 2.4, 2.6, 3.0, 3.2]
- If you wnat to estimate density at x = 2.5, you plug x = 2.5 into the KDE Formula,
- The kernel function measures how close each data point x_i, is to your target point x, scaled by the bandwith "h"
4-) Intuition:
- If many data points lie close to $x$, the estimated density $\hat{f}(x)$ will be high; if few data points are near $x$, it will be low.
- The bandwidth h controls the smoothing. Large means smoother curves, small means more detail.
5-) So if we do not assuming a parametric form for data why we use for example gaussian kernel function ? even we have already known that our data or some of our variables or not in Gaussian form ? (There are many kernel functions, I mentioned Gaussian kernel function to be understood easily for the most people)
We don’t assume the data follows a Gaussian distribution, but we use a Gaussian-shaped weighting function (kernel) as a mathematical tool to estimate the density. It's not assuming the underlying data is Gaussian — it’s using a Gaussian-shaped curve to “spread” each data point’s influence locally.
When doing non-parametric estimation, like KDE, the goal is: Estimate the shape of the underlying density without assuming a global functional form (like normal, exponential, etc.). But to compute the estimate, we still need some local smoothing mechanism — that’s what the kernel function is for. The kernel is just a window function used to spread the influence of each data point. It could be Gaussian, Epanechnikov, Uniform, or others. The choice of kernel has minor impact — what matters most is the bandwidth h.
6-) Think of it like this:
- Imagine you place a little “bump” (bell curve) centered at each data point.
- Those bumps can be Gaussian, triangular, etc.
- Then you add up all those bumps → this gives you a smooth approximation of the PDF.
- The bump shape (i.e., kernel) is not a claim about your data's distribution, just a tool for estimating the density.
7-) KDE Based Algorithms:
Kernel Density-based Anomaly Detection: KDE can be used to estimate the density of normal data points. Points falling in low-density regions (low KDE values) are flagged as anomalies or outliers.
Mean Shift Clustering: This is a clustering algorithm that uses KDE to find modes (peaks) in the data distribution. It iteratively shifts points towards the nearest high-density region estimated by KDE.
Non-parametric Bayesian Methods: Some Bayesian non-parametric models use KDE-like approaches to estimate distributions without assuming parametric forms.
Density-based Classification: KDE can be used to estimate class-conditional densities in a non-parametric way, which can then be used for classification (e.g., Parzen window classifiers).
Generative Models: KDE can be used as a simple generative model to sample new data points by sampling from the estimated density.
While KDE itself is not a standalone "algorithm" like decision trees or SVMs, it is a fundamental tool used within many algorithms for density estimation, clustering, anomaly detection, and more.
Lets show it¶
# Define Gaussian kernel function
def gaussian_kernel(x, xi, bandwidth):
return (1 / (bandwidth * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - xi) / bandwidth) ** 2)
# Function to plot KDE and bumps for a target variable
# using all features combined as data points
# Plot KDE with bumps inside
def plot_kde_with_bumps(data, bandwidth=0.3):
x_vals = np.linspace(min(data) - 1, max(data) + 1, 500)
# Select 20 random points for bumps
np.random.seed(42) # for reproducibility
selected_points = np.random.choice(data, size=50, replace=False)
bumps = np.array([gaussian_kernel(x_vals, xi, bandwidth) for xi in selected_points])
kde = np.array([gaussian_kernel(x_vals, xi, bandwidth) for xi in data]).sum(axis=0) / len(data)
plt.figure(figsize=(12, 6))
plt.plot(x_vals, kde, color='blue', lw=2, label='KDE')
max_kde = kde.max()
for bump in bumps:
scaled_bump = bump * max_kde * 0.4
plt.fill_between(x_vals, 0, scaled_bump, color='gray', alpha=0.3)
sns.rugplot(data, color='black')
plt.title('KDE with 20 Larger Gaussian Bumps Inside for Non-Parametric Variable X4')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()
plot_kde_with_bumps(X4)
Using a Gaussian kernel in KDE does not mean you're assuming the data is Gaussian — you're just using a Gaussian-shaped function to help estimate the unknown density in a flexible, non-parametric way.
If a variable has a non-parametric distribution, and we estimate its density using KDE with a Gaussian kernel, how does that help us use it in linear regression as an independent variable? What’s the benefit?¶
- KDE is for estimating the distribution of a variable — not for transforming or modeling the variable directly in regression.
- So one doesnt needs to apply KDE to the independent variable before using it in linear regression. But KDE can still be useful in two key ways:
1-) Understanding the Marginal Distribution of the Variable
- If you're unsure of your predictor’s distribution (e.g., skewed, multimodal), KDE helps visualize or test this.
- This can inform: Whether to apply transformations (e.g., log, Box-Cox),Whether linear regression assumptions (like homoscedasticity, linearity) might be violated
2-) Feature Engineering via Density Estimates (Advanced Use) One could use KDE as a feature generator: For example, if a variable X has a strange distribution, you could compute: — the estimated density at each point — and use that as an additional input in your regression.
This can be useful if:
- You want to give the model access to how common or rare a value is.
- You're doing anomaly detection, probability modeling, or semi-parametric modeling.
3-) What About Linear Regression? Linear regression doesn’t care if your independent variable has a non-parametric distribution. It only assumes:
- The relationship between independent and dependent variables is linear
- The residuals (errors) are normally distributed (for valid inference)
So:
- A skewed or weirdly-shaped independent variable is fine.
- You don’t need to apply KDE to that variable before regression.
Instead, KDE helps you analyze, explore, or preprocess the variable before putting it in the model.
Lets Make Modelling¶
Lets first check correlation graph to check linearity¶
# Combine all variables into a DataFrame
data = pd.DataFrame({
'target_skewed': target_skewed,
'target_gaussian': target_gaussian,
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4
})
# Calculate correlation matrix
corr_matrix = data.corr()
# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Targets and Features')
plt.show()
# Linear regression and residual analysis function
def linear_regression_residuals(X, y):
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred
# Calculate metrics
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
print(f'R² Score: {r2:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
# Plot residuals histogram
plt.figure(figsize=(6, 4))
sns.histplot(residuals, kde=True, color='red')
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
# Normality test (Shapiro-Wilk)
stat, p_value = shapiro(residuals)
print(f'Shapiro-Wilk Test: Statistic={stat:.4f}, p-value={p_value:.4f}')
if p_value > 0.05:
print('Residuals appear to be normally distributed (fail to reject H0)')
else:
print('Residuals do not appear to be normally distributed (reject H0)')
return residuals
# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X, target_skewed)
# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_gaussian = linear_regression_residuals(X, target_gaussian)
Right Skewed Target Model Residuals Analysis: R² Score: 0.1562 Mean Squared Error (MSE): 24.7569 Root Mean Squared Error (RMSE): 4.9756
Shapiro-Wilk Test: Statistic=0.2556, p-value=0.0000 Residuals do not appear to be normally distributed (reject H0) Gaussian Target Model Residuals Analysis: R² Score: 0.6307 Mean Squared Error (MSE): 0.4278 Root Mean Squared Error (RMSE): 0.6540
Shapiro-Wilk Test: Statistic=0.9972, p-value=0.5608 Residuals appear to be normally distributed (fail to reject H0)
Comment: As you can see the applying of the linear regression to two target variables, you can see that one holds assumptions of linear regression and one doesnt provide the assumptions(above one) So lets play with the skewed target variable since its kde plot tells me there is a need to transform so I will apply Box-Cox transformation since it is highly right skewed, plot it and then check results
# Apply Box-Cox transformation (Box-Cox requires positive data)
target_skewed_positive = target_skewed + 1e-6 # shift to avoid zero
transformed, lambda_bc = boxcox(target_skewed_positive)
# Plot distributions
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(target_skewed, kde=True, ax=axs[0], color='orange')
axs[0].set_title('Original Right Skewed Target Distribution')
sns.histplot(transformed, kde=True, ax=axs[1], color='blue')
axs[1].set_title(f'Box-Cox Transformed Target Distribution (lambda={lambda_bc:.3f})')
plt.tight_layout()
plt.show()
# Create DataFrame for correlation
corr_df = pd.DataFrame({
'Original_Skewed': target_skewed,
'BoxCox_Transformed': transformed,
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4
})
# Correlation matrix
corr_matrix = corr_df.corr()
# Plot heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap: Original vs Box-Cox Transformed')
plt.show()
# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X, target_skewed)
# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_box_cox = linear_regression_residuals(X, transformed)
Right Skewed Target Model Residuals Analysis: R² Score: 0.1562 Mean Squared Error (MSE): 24.7569 Root Mean Squared Error (RMSE): 4.9756
Shapiro-Wilk Test: Statistic=0.2556, p-value=0.0000 Residuals do not appear to be normally distributed (reject H0) Gaussian Target Model Residuals Analysis: R² Score: 0.6372 Mean Squared Error (MSE): 0.4049 Root Mean Squared Error (RMSE): 0.6363
Shapiro-Wilk Test: Statistic=0.9978, p-value=0.7636 Residuals appear to be normally distributed (fail to reject H0)
Comment: As you can see that we have a linear regression model too with Box-Cox transformation, it means we used KDE to infer what our data says to us, how we can deal it, and how to use it for our purposes. On the other hand look carefully the regression results of gaussian target, and box-cox target variables.
Lets also play with the non-parametric X4 variable, to check if we use it in our lineer model:
- Estimate the density values at each data point:
Use KDE to compute the estimated density (probability density function value) at each X4 data point. This density value reflects how common or rare that value is in the distribution.
- Use the KDE density values as a new feature:
Add these density estimates as an additional feature in your model. Points in dense regions get higher values, and points in sparse regions get lower values, which can help the model understand the data distribution better.
- Optionally, evaluate KDE at new points:
If you want to predict on new data, you can evaluate the KDE at those new X4 values to get their density feature.
# X4 is your 1D non-parametric variable, shape (n_samples,)
X4_reshaped = X4.reshape(-1, 1)
# Fit KDE model on X4
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X4_reshaped)
# Compute log density for each point
log_density = kde.score_samples(X4_reshaped)
# Convert log density to density
density_feature = np.exp(log_density)
# Plot distribution of KDE feature
plt.figure(figsize=(8, 5))
sns.histplot(density_feature, kde=True, color='teal')
plt.title('Distribution of KDE Feature from X4')
plt.xlabel('KDE Density Value')
plt.ylabel('Frequency')
plt.show()
# Now density_feature can be added as a new feature to your dataset
# Create DataFrame for correlation
corr_df = pd.DataFrame({
'Target_Skewed': target_skewed,
'Target_Gaussian': target_gaussian,
'Target_BoxCox': transformed,
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'X4_KDE_Feature': density_feature
})
# Correlation matrix
corr_matrix = corr_df.corr()
# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap: Targets and Predictors (including KDE feature)')
plt.show()
Comment: Lets apply to linear regression model with discarding X4 and putting X4_KDE_Feaure variable
# Combine features into a matrix
X_kde = np.vstack([X1, X2, X3, density_feature]).T
# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X_kde, target_skewed)
# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_box_cox = linear_regression_residuals(X_kde, transformed)
Right Skewed Target Model Residuals Analysis: R² Score: 0.1598 Mean Squared Error (MSE): 24.6512 Root Mean Squared Error (RMSE): 4.9650
Shapiro-Wilk Test: Statistic=0.2593, p-value=0.0000 Residuals do not appear to be normally distributed (reject H0) Gaussian Target Model Residuals Analysis: R² Score: 0.6372 Mean Squared Error (MSE): 0.4050 Root Mean Squared Error (RMSE): 0.6364
Shapiro-Wilk Test: Statistic=0.9978, p-value=0.7640 Residuals appear to be normally distributed (fail to reject H0)
Comment: Same but a little slightly improved results, lets apply Xgboost to finalize the study and check feature importances
XGBoost Regression Assumptions
XGBoost is a powerful gradient boosting algorithm with the following key assumptions and characteristics:
Additive model: Builds an ensemble of weak learners (usually decision trees) sequentially, each correcting errors of the previous.
No strict distribution assumptions: Unlike linear regression, XGBoost does not assume linearity or normality of residuals.
Feature independence: While not strictly required, uncorrelated features often help model performance.
Sufficient data: Works best with enough data to learn complex patterns.
Hyperparameter tuning: Performance depends on tuning parameters like learning rate, tree depth, and regularization
# Prepare feature matrix and targets
X = np.column_stack([X1, X2, X3, density_feature])
targets = {
'Gaussian': target_gaussian,
'Skewed': target_skewed,
'BoxCox': transformed
}
# Function to train, predict and evaluate XGBoost model
def train_evaluate_xgb(X, y, target_name):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = XGBRegressor(random_state=42, n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f'--- {target_name} Target ---')
print(f'R² Score: {r2:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.4f}\n')
# Feature importance plot
plt.figure(figsize=(8, 4))
sns.barplot(x=model.feature_importances_, y=['X1', 'X2', 'X3', 'X4_KDE'])
plt.title(f'Feature Importances for {target_name} Target')
plt.xlabel('Importance')
plt.show()
# Run for each target
for name, y in targets.items():
train_evaluate_xgb(X, y, name)
--- Gaussian Target --- R² Score: 0.5490 Mean Squared Error (MSE): 0.5951 Root Mean Squared Error (RMSE): 0.7714
--- Skewed Target --- R² Score: 0.0723 Mean Squared Error (MSE): 110.3246 Root Mean Squared Error (RMSE): 10.5036
--- BoxCox Target --- R² Score: 0.3718 Mean Squared Error (MSE): 0.6480 Root Mean Squared Error (RMSE): 0.8050
Comment: It is good to see that Xgboost model underperforms regard to linear regression model. Even Xgboost is a powerful algorithm, our data told us that it has linear or near-linear distribution so applying a non-linear algorithm is needless to this case. On the other hand check the variable X4_KDE from the aspect of non-linearity, Its importance increased. However in this data case it is needless.
Final Result: Always look, check and play with your data, do not trust on just algorithms themselves. After this you will not be stranged when plotting KDE plot with seaborn or another library, do not forget that it gives us underlying distribution of our variable. Thanks