Mercedes Benz Greener manufacturing

PARTH SALUNKE
15 min readFeb 4, 2022

Kaggle problem statement Mercedes-Benz Greener Manufacturing

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing.

Table of content

  1. Business Problem
  2. Business Constraint
  3. Performance Metrics
  4. ML Formulation
  5. Data Analysis
  6. EDA
  7. Feature Engineering
  8. Modelling / Approaches Tried
  9. model comparison
  10. model pipeline
  11. future extension
  12. Reference

1.BUSINESS PROBLEM-

Any automobile vehicle is launched in the market after successfully completing various tests which ensure safety and standards which are set by WHO road safety measure . If vehicles fulfil all the standards then automobile companies are able to launch them to market. Mercedes is one of the premium car making companies which also build custom made cars as per customer need , so all custom cars first go through several tests if they pass guidelines & standards then car deliver to customer . Mercedes has a lot of features to test so it takes a lot of time to test . As we know testing is a time & cost consuming process .Nowadays companies like Mercedes are shifting towards automation testing which is more efficient than manual testing.as Mercedes shifting toward automation error due to human intervention/behaviour reducing rapidly .

So, in a given problem statement we have to build a model which predicts the accurate time of the car spent on the testing system with the help of thousands of cars data with various features regarding cars. The model will help to build an efficient automation system. & as the automation system gets efficient time on the test.

2.Business constraint -

1.predict accurate time car spend on testing system

2.No latency restriction we can predict in 1 to 2 minutes

3.Performance Metrics

  1. The evaluation metric for the competition is the R² measure, known as the coefficient of determination. R2 score gives the percentage of Y(target variable) lying within the regression line.R2 is ratio of residual sum of square to sum of square of sum of square of perpendicular distance between data point and avg line. R2 scores have an upper bond as 1 so are easy to use rather than MAE which lie in between 0 to infinity. R2 score is sensitive to outliers if we use linear models , if we use ensemble method like boosting , Tree base approach R2 score gives robust results.

2. We can also use MAE (mean absolute error ) . It is the difference between target value and predicted values , it is robust to outliers and does not penalize the error.

4.ML formulation-

we have 376 variables as car features and we need to predict time spend of car on testing system . we can look at it as regression problem .

Input dependent variables (features of car) , we need to predict real values as independent variables/ output .

4.Data Analysis-

  • We have a total of 376 variables, each variable is unique features of cars . Mercedes have Class A-B-C-S type of cars with different unique features for each class too.
  • dataset contain 4209 no of data point in train and 378 no of features
  • train and test data contain 8 no. of categorical features
  • train and test data contain 369 no. of binary features
  • target values lie between 72 to 265

6.EDA

First I load train and test data in CSV format.

train & test data

1.In the dataset we have 4209 numbers of data points and 378 features.

2.There are no duplicates in data.

3.We have a total of 8 dependent categorical features and 369 of dependent binary features.

4.categorical features have data type object(8) and binary features have data type int(64)

5.independent target variable have data type float(64)

Let’s start with Exploratory Data Analysis

EDA on Dependent target variable (y)- variance and skewness

variance and skewness
  • target variable have 160.77 variance in data so data is fairly spread
  • target variable skewness score is 1.21 which means target variable have high skewness towards right

Distribution of Dependent target variable (y)

  • from given histogram i understand that target variable have high variance in data.
  • From scatter plot i visualize that , their are very few point above 150 (i think those data points belong to premium car models because , premium model have more features than normal car so it’s also take more time on testing system).

As we know sometimes after applying log-transformation we can convert distribution to normal so , let’s apply log transformation to dependent variable y.

After log-transformation i don’t see any changes in target variable .target variable don’t look like gaussian normal so I will skip log-transformation based approach.

Dependent variables contain some outliers as we see in scatter plot so , let’s check percentile values which will give us clear understanding which threshold values to set for outliers.

From 0 to 90 percentile we have good variance in data but from 90 to 100 percentile values we have some extreme values(outliers). I found outliers at 100 percentile value, let’s zoom in between 90 to 100 percentile and check we have any major outliers.

We have good variance in data from 90 to 98 but In between 99 to 100 percentile values we have some extreme values so i will now zoom in between 99 to 100

We still have a good variance in between 99.0 to 99.9 range so now we will set threshold as 160.All the points >160 are set as outliers.

InDependent categorical variable-

We have 8 categorical features , so we will do a plot boxplot for each categorical feature.

Categorical features analysis -

  • I go through all the categorical features. Understand that feature X0 ,X1 ,X2 ,X3 ,X5 ,X6 , X7 ,X8 have good variance in data which will help to build robust model
  • Feature X4 does not have variance/don’t have information in it. I am going to remove X4 from data
  • I also notice that some categories in features occur only if the target variable is greater than 160 so we can say that some categories are special to the cars which get high time on the testing system. As we have very few data points above 160 target variables.
  • very few custom made cars which have more features/ unique (category) which just belong to them . so while building model we can just remove categories which have low variance.
  • I also remove outliers / extreme point from target variable but its not helpful because it impact on some categories in features.

InDependent Binary variable-

as we have a lot of binary features , we can’t visualize them so first i will select top feature which contribute to predict target variable

I am using f_classif = ANOVA F-value to extract top 250 features

I Can properly visualise given box plots.

  1. Feature X260 feature X270 and X205 have just “0” for all data points it means given features are constant as we have 0 for all data points. So I am going to remove such features Because it is not useful to predict target variables.
  2. feature X252 and feature X260 have same variance in data as we see both have same median , min and max values so we will remove those duplicates also from data
  3. some binary features contain constant/ 0 variance. so i am going to remove such values

Now I am going to remove features which have the same variance .

if i remove those features from data then model will be perform good because then we will have less features to work with

with the help of scatter plot i visualize that lot of features have same variance , so i am going to remove those features , most of features are close to zero variance

Now the scatter plot looks clear. i removed all duplicate variance features

I clean binary features . I removed features which have the same variance and constant values as binary features.

now i am going to visualize top 5 binary features

After cleaning all those binary features, the new top 10 binary features gave me good results .

all binary features have good variance , all features have different variance and none of them have just one class values

visualizing top 20 corr heatmap and removing highly correlated features

their are some features X272 & X162 , X279 & X162 , X328 & X279 , X162 & X76 are highly correlated with each other having corr values of 0.95 to 0.97. i am going to drop highly correlated features from data.

now i will check corr 1 Vs all features

Heatmap X-384 ,X-328 , X-167 , X-136

feature X384 have linear relationship between all top 20 features so i can use X384 for modelling

features X328 is correlated with 10+ features . so i am going to set threshold <=96 and remove features which is corr above 0.96

feature X167 have linear relationship between all top 20 features so i can use X167 for modelling

X136 is negatively correlated with most of the features .

I can use negatively correlated features also but I will set threshold 0.95 .I will use features which have correlation between -0.95 to 0.95. i will remove features which is above or below given threshold .first i will use all top features and then features within threshold range and then i am going to check model is improving or not

I am removing a total of 35 highly correlated features.

I have built three functions . first function give use top k binary features , second feature give use final second categorical features and third feature give use encode categorical features

Features Engineering

We have a lot of binary features so i am going to use PCA (Principal component analysis) and truncated-SVD (truncated singular value decomposition)

1.PCA

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, PCA transforms large data sets into small values in which the majority of information is stored .It uses the eigenvector of the covariance matrix to create a linear projection of high dimensional data into low dimensional spaces. In PCA first we calculate the covariance matrix of features. Then we calculate the eigenvector and values of the covariance matrix. Then we sort eigenvectors and pick top eigenvectors to represent data in lower dimension space

  1. standardize data
  2. Compute covariance matrix
  3. Compute eigenvectors and eigenvalues of the covariance matrix
  4. We will get top 10 eigenvectors

2. Truncated -SVD

singular value decomposition (SVD) is a factorization of matrix into product of two or more mattresses into single value.

We use SVD for feature extraction and dimensionality reduction.

Single value decomposition is basic decomposition of any Matrix into product of three matrices U S V.

Matrix U is covariance matrix of X in hierarchically arrangement .

Single values of matrix S are simple square root of eigenvalues . S contains the single values associated with corresponding column in U.

V is eigen vectors of XTX

8.Modelling / Approaches Tried

I have built a train model function which takes train data , test data and model.in given function we are fitting train data on ML model then i am using Randomized Search CV with 5 fold .finally i get the best model then we train model on X and Y and predict test data.

lasso regression

as we have regression problem , first I am going to train dataset on simple regression model and as per results I will use some complex models

1.top 100 binary and OHT categorical feature

Test private R2 score =0.37208

Test public R2 score =0.37208

lasso linear model gave me a decent score on top 100 binary and OHT categorical feature , Now I will try it on the remaining two data sets.

  • Original_features_&_PCA,SVD_component10

Test private R2 score =0.4108

Test public R2 score =0.41634

as we know PCA and SVD store higher dimension data into lower dimension space so , got a little bit of improvement in my score.

  • Original_binary_&_label_encoded_cat_feat dataset

Test private R2 score =0.37284

Test public R2 score =0.3855

simple lasso regression model not giving result on original binary and label encoded categorical features.

it just give slightly good result on PCA & SVD dataset.

now i am going to use complex model Decision Tree.

Decision Tree

  • top 100 binary and OHT categorical feature

Test private R2 score =0.5438

Test public R2 score =0.5462

Decision Tree regressor model give me good result on top 100 binary and OHT categorical feature.

  • Original_features_&_PCA,SVD_component10

Test private R2 score =0.5438

Test public R2 score =0.5462

Features engineered features give me good score on train data and test private

  • Original_binary_&_label_encoded_cat_feat dataset

Train R2 score =0.6172

Test private R2 score =0.51651

Test public R2 score =0.53424

decision tree give me some good results so now i am going to use more complex model and check if model improve or not.

Let’s try random forests.

Random Forest regressor

  • top 100 binary and OHT categorical feature

Test private R2 score =0.54616

Test public R2 score =0.5505

Model is overfitting , not much improvement lets try it on features engineered features

  • Original_features_&_PCA,SVD_component10

Test private R2 score =0.54119

Test public R2 score =0.55309

  • Original_binary_&_label_encoded_cat_feat dataset

Test private R2 score =0.54363

Test public R2 score =0.5540

Score is not improving, the model is overfitting as the previous one.

Now i am going to use XGB regressor

4. XGB regressor

  • top 100 binary and OHT categorical feature

Test private R2 score =0.52547

Test public R2 score =0.55113

Model is overfitting more than the decision tree and random forest.

  • Original_features_&_PCA,SVD_component10

Test private R2 score = 0.31139

Test public R2 score = 0.3412

XGB is not working well

  • Original_binary_&_label_encoded_cat_feat dataset

Test private R2 score = 0.42864

Test public R2 score = 0.32162

On original features model giving me 0.99 score on train data , that means model is totally overfitting.

complex model like XGB and Random Forest not improving after some stage so i am going to use some stacking technique . which learn in formation from different and model and dataset and stack it together. so different model can learn different information form data.

Stacking Method

splitting train data into train X & test X with test size=0.20 then splitting train X into D1 & D2 with 50% split.

we are making K datasets from D1

We are using different model on which all k dataset is trained

Stacking K predicted target variable and training target variable on meta model with stacked target variable predicting target variable on meta data model.

I have build function which take data as input and it perform stacking and return results.

Original_binary_&_label_encoded_cat_feat dataset

Test private R2 score =0.53662

Test public R2 score =0.54199

We are getting good result but same as random forest and decision tree ,

now i am going to try custom model on feature engineered data

  • Original_Original_features_&_PCA,SVD_component10

Test private R2 score =0.5299

Test public R2 score =0.53663

R2 score is still not improving , we need to use some other complex model .

lets go for Average modelling.

Average modelling

  • Original_binary_&_label_encoded_cat_feat dataset

Test private R2 score = 0.55201

Test public R2 score =0.55446

Average model giving me good result on test data. I am going to try it on a feature engineered feature to check if the model is improving or not .

  • Original_Original_features_&_PCA,SVD_component10

Test private R2 score =0.54226

Test public R2 score =0.55908

I am not getting good result on feature engineered feature so i am going with previous model

9.Model Comparisons:

model comparison for all model in given table

kaggle submission
kaggle Leaderboard

Model Pipeline

Future Extension:

a. we can use deep learning techniques to solve given problem

b. we can work with featurization techniques , skipping common features , just work with those features which are highly depends on output variable .

--

--