As many Data Scientists know, in the process of modelling, if you are not one of "Algoritm Data Scientist" who puts the data directly into ML algorithm and get results, the data preparation, checking nulls, outliers, imputing them or discarding (regard to your data), statistical assumption checkings (tests etc.), plotting distributions, feature engineering, feature selection and other works take on average %70 of time.

So apart from other data steps which I mentioned above, I think crucial first step is to look your data. So you should plot the your variables and get their density plots.

Why ?

Because once you have an idea about what is your data tell you, you can prepare it (regarding to your business problem) for modelling with some ML algorithms even linear or non-linear ones (if you know assumptions of these algorithms too).

So Lets dive into a data which I created artificially below and interpret what does it tells us..

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
from scipy.stats import boxcox
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.neighbors import KernelDensity
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import shapiro
from sklearn.linear_model import LinearRegression

# Seed for reproducibility
np.random.seed(42)

# Create correlated variables
# Base Gaussian variable
base_gaussian = np.random.normal(loc=0, scale=1, size=500)

# Targets
# Right skewed target correlated with base_gaussian (add noise)
target_skewed = np.exp(base_gaussian + np.random.normal(scale=0.5, size=500))

# Gaussian target correlated with base_gaussian (add noise)
target_gaussian = base_gaussian + np.random.normal(scale=0.5, size=500)

# Independent variables
# One Gaussian correlated with base_gaussian
X1 = base_gaussian + np.random.normal(scale=0.5, size=500)

# Two non-parametric random variables (less correlated)
X2 = np.random.uniform(low=-3, high=3, size=500)
X3 = np.random.beta(a=2, b=5, size=500) * 6 - 3


# Create an arbitrary empirical distribution (non-parametric)
empirical_data = np.concatenate([
    np.random.normal(0, 1, 200),
    np.random.uniform(3, 5, 150),
    np.random.beta(2, 5, 150) * 5
])

# Sample from this empirical distribution with replacement
X4 = np.random.choice(empirical_data, size=500, replace=True)

# Combine features into a matrix
X = np.vstack([X1, X2, X3, X4]).T

# Plot distributions of target variables and features
fig, axs = plt.subplots(2, 4, figsize=(18, 10))

sns.histplot(target_skewed, kde=True, ax=axs[0, 0], color='orange')
axs[0, 0].set_title('Right Skewed Target Distribution')

sns.histplot(target_gaussian, kde=True, ax=axs[0, 1], color='blue')
axs[0, 1].set_title('Gaussian Target Distribution')

sns.histplot(X1, kde=True, ax=axs[0, 2], color='green')
axs[0, 2].set_title('Gaussian Feature X1 Distribution')

sns.histplot(X2, kde=True, ax=axs[0, 3], color='purple')
axs[0, 3].set_title('Uniform Feature X2 Distribution')

sns.histplot(X3, kde=True, ax=axs[1, 0], color='brown')
axs[1, 0].set_title('Beta Feature X3 Distribution')

sns.histplot(X4, kde=True, ax=axs[1, 1], color='magenta')
axs[1, 1].set_title('Non Parametric Feature X4 Distribution')

# Hide unused subplots
axs[1, 2].axis('off')
axs[1, 3].axis('off')

plt.tight_layout()
plt.show()

So when you look at the graphs, I created two target variable one with right-skewed form and other is in Gaussian form. The four independent variables(predictors) have some gaussian, and non-gaussian forms.
- So Can I apply a simple linear regression model to them ?
- What are the assumptions of Linear Regression ?

- I will not mention all of the assumptions here but two:
- 1-) Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.
- 2-) Another is the linear relationship between target variable and predictors so We should expect Gaussian Distribution when we modeled in residuals, this is the important part.

- What these graphs tell us ?
- Transformation (right-skewed target variable) ?
- Outlier handling ?
- Imputation and other ideas ?
- Discarding uniform variable ?

All of this to be answered yourself to apply your modelling.

There are some parametric ones (for example X1, X2, X3) and non-parametric one (X4). Why I am talking about this ? Because in real life scenarios, we dont deal with known parametric distributions but highly non-parametric ones.

So how do we use these graphs ?

Here "Kernel Density Estimation" comes to play.

So What is "Kernel Density Estimation" ?

# Define Gaussian kernel function
def gaussian_kernel(x, xi, bandwidth):
    return (1 / (bandwidth * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - xi) / bandwidth) ** 2)

# Function to plot KDE and bumps for a target variable
# using all features combined as data points

# Plot KDE with bumps inside
def plot_kde_with_bumps(data, bandwidth=0.3):
    x_vals = np.linspace(min(data) - 1, max(data) + 1, 500)
    # Select 20 random points for bumps
    np.random.seed(42)  # for reproducibility
    selected_points = np.random.choice(data, size=50, replace=False)
    
    bumps = np.array([gaussian_kernel(x_vals, xi, bandwidth) for xi in selected_points])
    kde = np.array([gaussian_kernel(x_vals, xi, bandwidth) for xi in data]).sum(axis=0) / len(data)

    plt.figure(figsize=(12, 6))
    plt.plot(x_vals, kde, color='blue', lw=2, label='KDE')

    max_kde = kde.max()
    for bump in bumps:
        scaled_bump = bump * max_kde * 0.4 
        plt.fill_between(x_vals, 0, scaled_bump, color='gray', alpha=0.3)

    sns.rugplot(data, color='black')
    plt.title('KDE with 20 Larger Gaussian Bumps Inside for Non-Parametric Variable X4')
    plt.xlabel('Value')
    plt.ylabel('Density')
    plt.legend()
    plt.show()

plot_kde_with_bumps(X4)

# Combine all variables into a DataFrame
data = pd.DataFrame({
    'target_skewed': target_skewed,
    'target_gaussian': target_gaussian,
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4
})

# Calculate correlation matrix
corr_matrix = data.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Targets and Features')
plt.show()

# Linear regression and residual analysis function
def linear_regression_residuals(X, y):
    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)
    residuals = y - y_pred

    # Calculate metrics
    r2 = r2_score(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    
    print(f'R² Score: {r2:.4f}')
    print(f'Mean Squared Error (MSE): {mse:.4f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
    
    # Plot residuals histogram
    plt.figure(figsize=(6, 4))
    sns.histplot(residuals, kde=True, color='red')
    plt.title('Residuals Distribution')
    plt.xlabel('Residuals')
    plt.ylabel('Frequency')
    plt.show()
    
    # Normality test (Shapiro-Wilk)
    stat, p_value = shapiro(residuals)
    print(f'Shapiro-Wilk Test: Statistic={stat:.4f}, p-value={p_value:.4f}')
    if p_value > 0.05:
        print('Residuals appear to be normally distributed (fail to reject H0)')
    else:
        print('Residuals do not appear to be normally distributed (reject H0)')
    
    return residuals

# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X, target_skewed)

# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_gaussian = linear_regression_residuals(X, target_gaussian)

Right Skewed Target Model Residuals Analysis:
R² Score: 0.1562
Mean Squared Error (MSE): 24.7569
Root Mean Squared Error (RMSE): 4.9756

Shapiro-Wilk Test: Statistic=0.2556, p-value=0.0000
Residuals do not appear to be normally distributed (reject H0)

Gaussian Target Model Residuals Analysis:
R² Score: 0.6307
Mean Squared Error (MSE): 0.4278
Root Mean Squared Error (RMSE): 0.6540

Shapiro-Wilk Test: Statistic=0.9972, p-value=0.5608
Residuals appear to be normally distributed (fail to reject H0)

# Apply Box-Cox transformation (Box-Cox requires positive data)
target_skewed_positive = target_skewed + 1e-6  # shift to avoid zero
transformed, lambda_bc = boxcox(target_skewed_positive)

# Plot distributions
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

sns.histplot(target_skewed, kde=True, ax=axs[0], color='orange')
axs[0].set_title('Original Right Skewed Target Distribution')

sns.histplot(transformed, kde=True, ax=axs[1], color='blue')
axs[1].set_title(f'Box-Cox Transformed Target Distribution (lambda={lambda_bc:.3f})')

plt.tight_layout()
plt.show()

# Create DataFrame for correlation
corr_df = pd.DataFrame({
    'Original_Skewed': target_skewed,
    'BoxCox_Transformed': transformed,
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4
})

# Correlation matrix
corr_matrix = corr_df.corr()

# Plot heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap: Original vs Box-Cox Transformed')
plt.show()

# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X, target_skewed)

# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_box_cox = linear_regression_residuals(X, transformed)

Right Skewed Target Model Residuals Analysis:
R² Score: 0.1562
Mean Squared Error (MSE): 24.7569
Root Mean Squared Error (RMSE): 4.9756

Shapiro-Wilk Test: Statistic=0.2556, p-value=0.0000
Residuals do not appear to be normally distributed (reject H0)

Gaussian Target Model Residuals Analysis:
R² Score: 0.6372
Mean Squared Error (MSE): 0.4049
Root Mean Squared Error (RMSE): 0.6363

Shapiro-Wilk Test: Statistic=0.9978, p-value=0.7636
Residuals appear to be normally distributed (fail to reject H0)

# X4 is your 1D non-parametric variable, shape (n_samples,)
X4_reshaped = X4.reshape(-1, 1)

# Fit KDE model on X4
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X4_reshaped)

# Compute log density for each point
log_density = kde.score_samples(X4_reshaped)

# Convert log density to density
density_feature = np.exp(log_density)

# Plot distribution of KDE feature
plt.figure(figsize=(8, 5))
sns.histplot(density_feature, kde=True, color='teal')
plt.title('Distribution of KDE Feature from X4')
plt.xlabel('KDE Density Value')
plt.ylabel('Frequency')
plt.show()

# Now density_feature can be added as a new feature to your dataset
# Create DataFrame for correlation
corr_df = pd.DataFrame({
    'Target_Skewed': target_skewed,
    'Target_Gaussian': target_gaussian,
    'Target_BoxCox': transformed,
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4,
    'X4_KDE_Feature': density_feature
})

# Correlation matrix
corr_matrix = corr_df.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap: Targets and Predictors (including KDE feature)')
plt.show()

# Combine features into a matrix
X_kde = np.vstack([X1, X2, X3, density_feature]).T

# Apply linear regression and residual analysis for right skewed target
print('Right Skewed Target Model Residuals Analysis:')
residuals_skewed = linear_regression_residuals(X_kde, target_skewed)

# Apply linear regression and residual analysis for gaussian target
print('\nGaussian Target Model Residuals Analysis:')
residuals_box_cox = linear_regression_residuals(X_kde, transformed)

Right Skewed Target Model Residuals Analysis:
R² Score: 0.1598
Mean Squared Error (MSE): 24.6512
Root Mean Squared Error (RMSE): 4.9650

Shapiro-Wilk Test: Statistic=0.2593, p-value=0.0000
Residuals do not appear to be normally distributed (reject H0)

Gaussian Target Model Residuals Analysis:
R² Score: 0.6372
Mean Squared Error (MSE): 0.4050
Root Mean Squared Error (RMSE): 0.6364

Shapiro-Wilk Test: Statistic=0.9978, p-value=0.7640
Residuals appear to be normally distributed (fail to reject H0)

# Prepare feature matrix and targets
X = np.column_stack([X1, X2, X3, density_feature])
targets = {
    'Gaussian': target_gaussian,
    'Skewed': target_skewed,
    'BoxCox': transformed
}

# Function to train, predict and evaluate XGBoost model
def train_evaluate_xgb(X, y, target_name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = XGBRegressor(random_state=42, n_estimators=100, learning_rate=0.1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    print(f'--- {target_name} Target ---')
    print(f'R² Score: {r2:.4f}')
    print(f'Mean Squared Error (MSE): {mse:.4f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.4f}\n')
    
    # Feature importance plot
    plt.figure(figsize=(8, 4))
    sns.barplot(x=model.feature_importances_, y=['X1', 'X2', 'X3', 'X4_KDE'])
    plt.title(f'Feature Importances for {target_name} Target')
    plt.xlabel('Importance')
    plt.show()

# Run for each target
for name, y in targets.items():
    train_evaluate_xgb(X, y, name)

--- Gaussian Target ---
R² Score: 0.5490
Mean Squared Error (MSE): 0.5951
Root Mean Squared Error (RMSE): 0.7714

--- Skewed Target ---
R² Score: 0.0723
Mean Squared Error (MSE): 110.3246
Root Mean Squared Error (RMSE): 10.5036

--- BoxCox Target ---
R² Score: 0.3718
Mean Squared Error (MSE): 0.6480
Root Mean Squared Error (RMSE): 0.8050

09.06.2025¶

Hincal Topcuoglu Phd(c)¶

Why Do We Need to Look at Distributions before setting up a Model ?¶

¶

Lets show it¶

If a variable has a non-parametric distribution, and we estimate its density using KDE with a Gaussian kernel, how does that help us use it in linear regression as an independent variable? What’s the benefit?¶

Lets Make Modelling¶

Lets first check correlation graph to check linearity¶

This Study was done by Hincal Topcuoglu¶