Preliminary Analysis¶
- Notebook Review:
- Open the
Preliminary_analysis
notebook to identify key features. - Visualize the data transformations and analyze the trend of the target variable.
- All the decision in feature engineering is done based on this analysis
- Open the
import importlib
import custom_utils
importlib.reload(custom_utils)
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import PowerTransformer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.transformation import YeoJohnsonTransformer
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import RareLabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.model_selection import KFold, cross_validate
from sklearn.metrics import make_scorer, r2_score, mean_squared_error
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
Project Overview¶
Research Objective¶
The primary goal of this project is to predict house prices by applying a comprehensive machine learning workflow. The study focuses on identifying the key factors influencing house prices and developing a robust model that can generalize well on unseen data. This is achieved by detailed exploratory analysis, rigorous feature engineering, feature selection through cross-validation, and prediction using linear regression.
Different Aspects of the Study¶
Data Acquisition and Cleaning:
The dataset was downloaded and loaded into a DataFrame where initial data inconsistencies (e.g., duplicate labels in categorical variables) were addressed. Exploratory Data Analysis (EDA) laid the foundation for subsequent preprocessing tasks.Feature Engineering:
The workflow includes both numerical and categorical transformations:- Numerical features undergo transformations such as logarithmic and square root scaling, imputation, and binning to normalize distributions and enhance model performance.
- Categorical variables are treated with custom encoding, rare label grouping, and one-hot encoding to address issues of sparsity and noisy features.
Feature Selection:
Recursive Feature Elimination with Cross-Validation (RFECV) was employed to autonomously select the most informative features, ensuring that the final model is not overburdened with irrelevant or redundant predictors.Model Training and Evaluation:
The final model integrates the preprocessing pipelines and uses a linear regression estimator. Model performance is validated through cross-validation and test set evaluation, with performance metrics including R² and MSE being used to assess prediction accuracy.Post Processing and Interpretation:
The project includes visualizations for model interpretation where predicted values are compared against actual house prices. Feature importance is assessed based on model coefficients to understand the influence of selected features on the prediction.
Conclusion¶
This project demonstrates an end-to-end approach in building a predictive model for house prices. The blend of advanced feature engineering, effective feature selection, and robust evaluation provides a clear methodology for identifying key drivers of house price variance. The outcome not only highlights strong predictive performance but also offers insights into the relative importance of various features, contributing to informed decision-making in a real estate context.
Dataset Preparation¶
- Data Acquisition:
- Download the dataset directly from the drive and move it to the designated dataset folder.
- Data Import and Cleaning:
- Import the dataset into a DataFrame.
- During exploratory data analysis (EDA), it was observed that the
SaleCondition
categorical column contains two labels,'normal'
and'Normal'
, which are treated as the same. Correct this inconsistency.
google_drive_link = "https://drive.google.com/file/d/1LqK2BvE6eGKIdbLHXaxN3aLx6dekx6B7/view?usp=drive_link"
custom_utils.download_dataset(google_drive_link)
house_data = pd.read_csv("dataset/dataset.csv")
house_data['SaleCondition'] = house_data['SaleCondition'].replace('normal', 'Normal')
X_train, X_test, y_train, y_test = train_test_split(
house_data.drop('SalePrice', axis=1), # predictive variables
house_data['SalePrice'], # target
test_size=0.1, # portion of dataset to allocate to test set
random_state=0, # we are setting the seed here
)
X_train.shape, X_test.shape
((1314, 21), (146, 21))
FEATURE ENGINEERING¶
- Target Transformation:
- Apply a logarithmic transformation to the
SalePrice
variable to reduce skewness in its distribution.
- Apply a logarithmic transformation to the
Numerical Pipeline¶
- Column Removal and Transformation:
- Remove
HalfBath
andLotType
as they do not significantly variance over the target variable based on EDA (refer toPreliminary_analysis
for visualization). - For the
Alley
column, replacedNaN
values with 0 and all other values with 1; however that this variable did not significantly impact performance, hence removed.
- Remove
- House Age Processing:
- Calculate the difference (e.g., between years built and a reference point) for
HouseAge
due to its clear trend withSalePrice
. - Remove one of the redundant columns.
- Apply a square root transformation to
HouseAge
better distribute the data (refer toPreliminary_analysis
for visualization).
- Calculate the difference (e.g., between years built and a reference point) for
- Garage Area Imputation:
- Impute missing values in
GarageArea
using the mean, keeping the approach simple due to its distribution.
- Impute missing values in
- Additional Transformations:
- Apply the Yeo-Johnson transformation to
LotArea
andGrLivArea
to transform to a Gaussian distribution.
- Apply the Yeo-Johnson transformation to
- Variable Binning:
- Use equal-width binning for continuous numerical variables listed in
var_binning
to reduce noise in theSalePrice
.
- Use equal-width binning for continuous numerical variables listed in
Categorical Pipeline¶
- Handling Rare Categories:
- Replace infrequent categories in the columns listed in
categorical_columns
with the label'Rare'
. This grouping aids in model generalization and prevents errors during evaluation from unidentified labels.
- Replace infrequent categories in the columns listed in
- Foundation Column Processing:
- For the
Foundation
column, note that two labels (PConc
andCBlock
) account for 82% of the data. - A label such as
'Do Not use this Field in the Model'
is grouped with other infrequent labels, as these are considered sensitive or outliers. - Apply a custom encoder that handles a maximum of three categories.
- For the
- Garage Type Imputation:
- Impute missing values in the
GarageType
column using a custom probabilistic imputer that aligns with the training data distribution as it influencesSalePrice
significantly.
- Impute missing values in the
- Label Mapping:
- Map labels based on a predefined encoding dictionary. The ordinal values are chosen according to the visual trend observed with
SalePrice
(with higher values indicating a higher sale price).
- Map labels based on a predefined encoding dictionary. The ordinal values are chosen according to the visual trend observed with
- One-Hot Encoding:
- Apply one-hot encoding to low-cardinality features that do not exhibit an ordinal relationship with
SalePrice
.
- Apply one-hot encoding to low-cardinality features that do not exhibit an ordinal relationship with
# Target variable transformation:
y_train = np.log(y_train)
y_test = np.log(y_test)
var_remove = ['HalfBath', 'Alley', 'LotType']
var_binning = ['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']
categorical_columns = [
'SaleType',
'HouseStyle',
'SaleCondition',
'Foundation',
'GarageType',
'BldgType',
'Street',
'CentralAir'
]
var_onehot_encode = ['SaleCondition', 'CentralAir', 'Foundation']
encoding_dict = {
'SaleType': {'Rare': 0, 'WD': 1, 'New': 2},
'HouseStyle': {'Rare': 0, '1.5Fin': 1, '1Story': 2, '2Story': 3},
'GarageType': {'Rare': 0, 'Detchd': 0, 'Attchd': 1, 'BuiltIn': 2},
'BldgType': {'Rare': 0, 'TwnhsE': 1, '1Fam': 2},
'Street': {'Pave': 0, 'Rare': 1}
}
numerical_pipeline = Pipeline([
# Step 1: Remove specified columns
('column_remover', custom_utils.ColumnRemover(variables_to_remove=var_remove)),
# Step 2: Compute difference between YearBuilt and YearSold
('year_transformer', custom_utils.TemporalVariableTransformer(variables=["YearBuilt"], reference_variable="YearSold")),
# Renaming and dropping step
('rename_drop', custom_utils.RenameDropTransformer(rename_dict={"YearBuilt": "HouseAge"}, drop_cols=["YearSold"])),
# Step 3: Apply sqrt transformation to HouseAge
('sqrt_transformer', custom_utils.SqrtTransformer(variables=["HouseAge"])),
# Step 4: Apply median imputation to GarageArea
('median_imputer', MeanMedianImputer(imputation_method='mean', variables=['GarageArea'])),
# Step 5: Apply Yeo-Johnson transformation
('yeo_johnson', YeoJohnsonTransformer(variables=['LotArea', 'GrLivArea'])),
# Step 6: Apply equal frequency discretization
('discretiser', EqualFrequencyDiscretiser(variables=var_binning, q=200))
])
# Create the categorical pipeline
categorical_pipeline = Pipeline([
# Step 7: Custom Rare label encoding for categorical variables
('rare_label_encoder', RareLabelEncoder(n_categories=1, replace_with='Rare', missing_values='ignore',
variables=categorical_columns)),
# Step 8: Special treatment for Foundation column
('foundation_rare_encoder', RareLabelEncoder(n_categories=1, max_n_categories=2,
replace_with='Rare', missing_values='ignore',
variables=['Foundation'])),
# Step 9: Handle Nan of GarageType with the labels accroding to distribution
('probabilistic_imputation', custom_utils.ProbabilisticCategoricalImputer(
variables='GarageType',
probabilities={'Attchd': 0.60, 'Detchd': 0.25, 'BuiltIn': 0.05, 'Rare': 0.10}
)),
# Step 10: Apply custom label mapping
('label_mapper', custom_utils.LabelMapper(mapping=encoding_dict)),
# Step 11: Apply one hot encoding
('onehot_encoder', custom_utils.OneHotEncodingTransformer(columns=var_onehot_encode))
])
preprocessing_pipeline = Pipeline([
('numerical_transformer', numerical_pipeline),
('categorical_transformer', categorical_pipeline)
])
preprocessed_X_train = preprocessing_pipeline.fit_transform(X_train)
FEATURE SELECTION¶
- Methodology:
- Employed
RFECV
from scikit-learn to perform automatic feature selection based on cross-validation scores. - Utilized 5-fold cross-validation to evaluate model performance.
- Visualized the number of selected features relative to the mean MSE.
- Employed
- Integration:
- Incorporated this feature selection method into the pipeline for automated feature selection.
preprocessed_X_train.info()
<class 'pandas.core.frame.DataFrame'> Index: 1314 entries, 930 to 684 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LotArea 1314 non-null int64 1 GrLivArea 1314 non-null int64 2 Street 1314 non-null int64 3 BldgType 1314 non-null int64 4 HouseStyle 1314 non-null int64 5 OverallQuality 1314 non-null int64 6 OverallCondition 1314 non-null int64 7 HouseAge 1314 non-null float64 8 TotalBsmtSF 1314 non-null int64 9 FullBath 1314 non-null int64 10 GarageType 1314 non-null int64 11 GarageCars 1314 non-null int64 12 GarageArea 1314 non-null int64 13 SaleType 1314 non-null int64 14 SaleCondition_Abnorml 1314 non-null float64 15 SaleCondition_Normal 1314 non-null float64 16 SaleCondition_Partial 1314 non-null float64 17 SaleCondition_Rare 1314 non-null float64 18 CentralAir_N 1314 non-null float64 19 CentralAir_Y 1314 non-null float64 20 Foundation_CBlock 1314 non-null float64 21 Foundation_PConc 1314 non-null float64 22 Foundation_Rare 1314 non-null float64 dtypes: float64(10), int64(13) memory usage: 246.4 KB
# Set up RFECV for regression
min_features_to_select = 1
estimator = LinearRegression()
cv = KFold(n_splits=5, shuffle=True, random_state=42)
# Use neg_mean_squared_error as the scoring metric for regression
rfecv = RFECV(
estimator=estimator,
step=1,
cv=cv,
scoring="neg_mean_squared_error",
min_features_to_select=min_features_to_select,
n_jobs=2
)
# Fit RFECV
rfecv.fit(preprocessed_X_train, y_train)
# Print optimal number of features
print(f"Optimal number of features: {rfecv.n_features_}")
# Plot number of features VS cross-validation scores
cv_results = pd.DataFrame(rfecv.cv_results_)
plt.figure(figsize=(10, 6))
plt.xlabel("Number of features selected")
plt.ylabel("Mean cross-validated MSE")
plt.plot(
cv_results["n_features"],
-cv_results["mean_test_score"], # Convert negative MSE back to positive for plotting
)
plt.fill_between(
cv_results["n_features"],
-cv_results["mean_test_score"] - cv_results["std_test_score"],
-cv_results["mean_test_score"] + cv_results["std_test_score"],
alpha=0.3
)
plt.title("Recursive Feature Elimination with Cross-Validation")
plt.grid(True)
plt.show()
Optimal number of features: 22
MODEL TRAINING¶
- Pipeline Composition:
- Integrated feature engineering, feature selection, normalization, and a linear regression model into a unified pipeline.
- Trained the complete pipeline end-to-end to streamline the modeling process.
feature_selector = RFECV(
estimator=LinearRegression(),
step=1, # Remove one feature at a time
cv=5, # 5-fold cross-validation
scoring='r2', # Use R² score as the evaluation metric
min_features_to_select=5 # Optionally set a minimum number of features to keep
)
ML_pipeline = Pipeline([
('feature_engineer', preprocessing_pipeline),
('feature_selector', feature_selector),
('normalizer', StandardScaler()),
('linear_regression', LinearRegression())
])
# # Fit with the filtered data
ML_pipeline.fit(X_train, y_train)
Pipeline(steps=[('feature_engineer', Pipeline(steps=[('numerical_transformer', Pipeline(steps=[('column_remover', ColumnRemover(variables_to_remove=['HalfBath', 'Alley', 'LotType'])), ('year_transformer', TemporalVariableTransformer(reference_variable='YearSold', variables=['YearBuilt'])), ('rename_drop', RenameDropTransformer(drop_cols=['YearSold'], rename_d... 'Rare': 0}, 'SaleType': {'New': 2, 'Rare': 0, 'WD': 1}, 'Street': {'Pave': 0, 'Rare': 1}})), ('onehot_encoder', OneHotEncodingTransformer(columns=['SaleCondition', 'CentralAir', 'Foundation']))]))])), ('feature_selector', RFECV(cv=5, estimator=LinearRegression(), min_features_to_select=5, scoring='r2')), ('normalizer', StandardScaler()), ('linear_regression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('feature_engineer', Pipeline(steps=[('numerical_transformer', Pipeline(steps=[('column_remover', ColumnRemover(variables_to_remove=['HalfBath', 'Alley', 'LotType'])), ('year_transformer', TemporalVariableTransformer(reference_variable='YearSold', variables=['YearBuilt'])), ('rename_drop', RenameDropTransformer(drop_cols=['YearSold'], rename_d... 'Rare': 0}, 'SaleType': {'New': 2, 'Rare': 0, 'WD': 1}, 'Street': {'Pave': 0, 'Rare': 1}})), ('onehot_encoder', OneHotEncodingTransformer(columns=['SaleCondition', 'CentralAir', 'Foundation']))]))])), ('feature_selector', RFECV(cv=5, estimator=LinearRegression(), min_features_to_select=5, scoring='r2')), ('normalizer', StandardScaler()), ('linear_regression', LinearRegression())])
Pipeline(steps=[('numerical_transformer', Pipeline(steps=[('column_remover', ColumnRemover(variables_to_remove=['HalfBath', 'Alley', 'LotType'])), ('year_transformer', TemporalVariableTransformer(reference_variable='YearSold', variables=['YearBuilt'])), ('rename_drop', RenameDropTransformer(drop_cols=['YearSold'], rename_dict={'YearBuilt': 'HouseAge'})), ('sqr... ('label_mapper', LabelMapper(mapping={'BldgType': {'1Fam': 2, 'Rare': 0, 'TwnhsE': 1}, 'GarageType': {'Attchd': 1, 'BuiltIn': 2, 'Detchd': 0, 'Rare': 0}, 'HouseStyle': {'1.5Fin': 1, '1Story': 2, '2Story': 3, 'Rare': 0}, 'SaleType': {'New': 2, 'Rare': 0, 'WD': 1}, 'Street': {'Pave': 0, 'Rare': 1}})), ('onehot_encoder', OneHotEncodingTransformer(columns=['SaleCondition', 'CentralAir', 'Foundation']))]))])
Pipeline(steps=[('column_remover', ColumnRemover(variables_to_remove=['HalfBath', 'Alley', 'LotType'])), ('year_transformer', TemporalVariableTransformer(reference_variable='YearSold', variables=['YearBuilt'])), ('rename_drop', RenameDropTransformer(drop_cols=['YearSold'], rename_dict={'YearBuilt': 'HouseAge'})), ('sqrt_transformer', SqrtTransformer(variables=['HouseAge'])), ('median_imputer', MeanMedianImputer(imputation_method='mean', variables=['GarageArea'])), ('yeo_johnson', YeoJohnsonTransformer(variables=['LotArea', 'GrLivArea'])), ('discretiser', EqualFrequencyDiscretiser(q=200, variables=['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']))])
ColumnRemover(variables_to_remove=['HalfBath', 'Alley', 'LotType'])
TemporalVariableTransformer(reference_variable='YearSold', variables=['YearBuilt'])
RenameDropTransformer(drop_cols=['YearSold'], rename_dict={'YearBuilt': 'HouseAge'})
SqrtTransformer(variables=['HouseAge'])
MeanMedianImputer(imputation_method='mean', variables=['GarageArea'])
YeoJohnsonTransformer(variables=['LotArea', 'GrLivArea'])
EqualFrequencyDiscretiser(q=200, variables=['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageArea'])
Pipeline(steps=[('rare_label_encoder', RareLabelEncoder(missing_values='ignore', n_categories=1, variables=['SaleType', 'HouseStyle', 'SaleCondition', 'Foundation', 'GarageType', 'BldgType', 'Street', 'CentralAir'])), ('foundation_rare_encoder', RareLabelEncoder(max_n_categories=2, missing_values='ignore', n_categories=1, variables=['Foundation'])), ('probabilisti... ('label_mapper', LabelMapper(mapping={'BldgType': {'1Fam': 2, 'Rare': 0, 'TwnhsE': 1}, 'GarageType': {'Attchd': 1, 'BuiltIn': 2, 'Detchd': 0, 'Rare': 0}, 'HouseStyle': {'1.5Fin': 1, '1Story': 2, '2Story': 3, 'Rare': 0}, 'SaleType': {'New': 2, 'Rare': 0, 'WD': 1}, 'Street': {'Pave': 0, 'Rare': 1}})), ('onehot_encoder', OneHotEncodingTransformer(columns=['SaleCondition', 'CentralAir', 'Foundation']))])
RareLabelEncoder(missing_values='ignore', n_categories=1, variables=['SaleType', 'HouseStyle', 'SaleCondition', 'Foundation', 'GarageType', 'BldgType', 'Street', 'CentralAir'])
RareLabelEncoder(max_n_categories=2, missing_values='ignore', n_categories=1, variables=['Foundation'])
ProbabilisticCategoricalImputer(probabilities={'Attchd': 0.6, 'BuiltIn': 0.05, 'Detchd': 0.25, 'Rare': 0.1}, variables=['GarageType'])
LabelMapper(mapping={'BldgType': {'1Fam': 2, 'Rare': 0, 'TwnhsE': 1}, 'GarageType': {'Attchd': 1, 'BuiltIn': 2, 'Detchd': 0, 'Rare': 0}, 'HouseStyle': {'1.5Fin': 1, '1Story': 2, '2Story': 3, 'Rare': 0}, 'SaleType': {'New': 2, 'Rare': 0, 'WD': 1}, 'Street': {'Pave': 0, 'Rare': 1}})
OneHotEncodingTransformer(columns=['SaleCondition', 'CentralAir', 'Foundation'])
RFECV(cv=5, estimator=LinearRegression(), min_features_to_select=5, scoring='r2')
LinearRegression()
LinearRegression()
StandardScaler()
LinearRegression()
MODEL EVALUATION¶
- Cross-Validation:
- Conducted cross-validation to assess various performance metrics on both training and validation sets.
- Achieved a high R² score of 0.83 (±0.008) during training.
- Test Set Performance:
- Evaluated the test dataset, obtaining an high R² score of 0.80.
def custom_r2(y_true, y_pred):
return r2_score(np.exp(y_true), np.exp(y_pred))
def custom_mse(y_true, y_pred):
return mean_squared_error(np.exp(y_true), np.exp(y_pred))
# Define scoring metrics
scoring = {
'R2': make_scorer(custom_r2),
'MSE': make_scorer(custom_mse, greater_is_better=False)
}
# Set up a KFold cross-validator
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# Run cross-validation on the training set
cv_results = cross_validate(ML_pipeline, X_train, y_train, cv=kfold, scoring=scoring, return_train_score=True)
test_pred = ML_pipeline.predict(X_test)
train_pred = ML_pipeline.predict(X_train)
test_metrics = custom_utils.calculate_metrics(np.exp(y_test), np.exp(test_pred))
# Display metrics
# Display the cross-validation results
print("\nCross-Validation Performance Metrics:")
print("=" * 50)
# Format and display metrics in a structured table
print(f"{'Metric':<10} {'Dataset':<10} {'Mean':<12} {'Std':<12}")
print("-" * 50)
# R2 scores
train_r2_scores = cv_results['train_R2']
print(f"{'R2':<10} {'Train':<10} {np.mean(train_r2_scores):.4f} {np.std(train_r2_scores):.4f}")
valid_r2_scores = cv_results['test_R2']
print(f"{'R2':<10} {'Valid':<10} {np.mean(valid_r2_scores):.4f} {np.std(valid_r2_scores):.4f}")
# MSE scores
train_mse_scores = cv_results['train_MSE']
print(f"{'MSE':<10} {'Train':<10} {np.mean(train_mse_scores):.4f} {np.std(train_mse_scores):.4f}")
valid_mse_scores = cv_results['test_MSE']
print(f"{'MSE':<10} {'Valid':<10} {np.mean(valid_mse_scores):.4f} {np.std(valid_mse_scores):.4f}")
# Display the test set performance in a nicely formatted way
print("\nTest Set Prediction Performance:")
print("=" * 50)
print(f"{'Metric':<15} {'Value':<10}")
print("-" * 50)
for metric, value in test_metrics.items():
if metric != 'Residuals':
print(f"{metric:<15} {value:.4f}")
Cross-Validation Performance Metrics: ================================================== Metric Dataset Mean Std -------------------------------------------------- R2 Train 0.8331 0.0079 R2 Valid 0.8279 0.0325 MSE Train -1017217834.3478 55232596.0199 MSE Valid -1056092951.6024 249124040.5176 Test Set Prediction Performance: ================================================== Metric Value -------------------------------------------------- MSE 1586941812.5927 RMSE 39836.4383 MAE 19739.6539 R² 0.8026
POST PROCESSING¶
plt.figure(figsize=(16, 12))
# Plot 1: Predicted vs Actual (Training set)
plt.subplot(2, 2, 1)
plt.scatter(np.exp(y_train), np.exp(train_pred), alpha=0.5)
plt.plot([np.exp(y_train).min(), np.exp(y_train).max()],
[np.exp(y_train).min(), np.exp(y_train).max()],
'r--', lw=2)
plt.xlabel('Actual House Price')
plt.ylabel('Predicted House Price')
plt.title('Training Set: Predicted vs Actual Prices')
plt.grid(True)
# Plot 2: Predicted vs Actual (Test set)
plt.subplot(2, 2, 2)
plt.scatter(np.exp(y_test), np.exp(test_pred), alpha=0.5, color='green')
plt.plot([np.exp(y_test).min(), np.exp(y_test).max()],
[np.exp(y_test).min(), np.exp(y_test).max()],
'r--', lw=2)
plt.xlabel('Actual House Price')
plt.ylabel('Predicted House Price')
plt.title('Test Set: Predicted vs Actual Prices')
plt.grid(True)
try:
# First transform X_train with the preprocessing pipeline
X_train_preprocessed = ML_pipeline.named_steps['feature_engineer'].transform(X_train)
# Get the feature selection mask from the fitted feature selector
selection_mask = ML_pipeline.named_steps['feature_selector'].support_
# Get the number of features
n_features_selected = np.sum(selection_mask)
# Generate generic feature names
preprocessed_feature_names = [f"Feature_{i}" for i in range(len(selection_mask))]
# Get the selected feature names
selected_feature_names = np.array(preprocessed_feature_names)[selection_mask]
print(f"Number of features after preprocessing: {len(preprocessed_feature_names)}")
print(f"Number of features selected: {n_features_selected}")
except Exception as e:
print(f"Method 1 failed: {e}")
# Fallback option
selected_feature_names = [f"Feature_{i}" for i in range(len(ML_pipeline.named_steps['linear_regression'].coef_))]
# Get coefficients from the linear regression model
coefficients = ML_pipeline.named_steps['linear_regression'].coef_
print(f"Number of coefficients: {len(coefficients)}")
# Create a DataFrame with selected feature names and coefficients
coef_df = pd.DataFrame({
'Feature': selected_feature_names,
'Coefficient': coefficients
})
# Sort by absolute coefficient value to determine importance
coef_df['abs_coef'] = coef_df['Coefficient'].abs()
top_features = coef_df.sort_values(by='abs_coef', ascending=False).head(15)
# Create visualization
plt.figure(figsize=(12, 8))
colors = ['#3498db' if coef >= 0 else '#e74c3c' for coef in top_features['Coefficient']]
bars = plt.barh(top_features['Feature'], top_features['Coefficient'], color=colors)
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.title('Top 15 Features by Importance in Linear Regression Model')
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.axvline(x=0, color='black', linestyle='-', alpha=0.5) # Add a vertical line at x=0
plt.gca().invert_yaxis() # Show highest importance at the top
plt.tight_layout()
plt.show()
Number of features after preprocessing: 23 Number of features selected: 22 Number of coefficients: 22