Predicting used car prices with Knime’s Visual Programming
Last summer as I spent hours browsing Carvana to find a good car within my budget, I kept thinking about coding a model to predict used car prices based on various conditions such as age, mileage, model type etc. There are several such models available in Kaggle and scores of approaches laid out through various articles in Medium. Being a big fan of Knime, I modeled a simple linear regression for Used car price prediction with Knime. In this article I will take you through the various steps — from data cleaning to prediction through Knime’s low-code approach.
Kaggle is a treasure for data. I looked up ‘Used Car Price’ in Kaggle and came upon scores of datasets. I have chosen the data set given in this link. The author has already divided the data into a train and test set. However for my model I am going with just the train set of data and I will partition this further with Knime.
Modeling with Knime:
To model this prediction with Knime, we go through the following steps — Data Reading → Data Cleaning → Data Categorising → Filtering → Data Partitioning →Linear Regression Learner → Linear Regression Predictor → Scorer. Let’s go through each step now.
The ‘CSV reader’ node is used for Data Input. The preview at the bottom of the Configuration Window shows the data in its raw form. In the ‘Transformation’ tab, you can choose which columns you want to include. This can act as an initial filtering for removing redundant columns. For our model, I have removed the ‘New Price’ and ‘unnamed blank row ID’ columns. A preview of the output of the CSV reader node with the read table is shown below.
As seen in the screenshot above, the output of the CSV reader has some rows missing values. The ‘Mileage’, ‘Engine’ and ‘Power’ columns are in string format having the units for mileage, engine capacity and power rating stated. The price is also listed as a fraction of 10,000s. These will be cleaned and readied in the next stages.
String Manipulation — Data Cleaning:
The String Manipulate Metanode consists of 4 ‘String Manipulation’ nodes each of which is used for a specific task. Taking a look at the data from the CSV reader, we see that the Mileage column has units of ‘kmpl’ and ‘km/kg’. The Engine column has units of ‘CC’ and the Power column has units of ‘bhp’. These need to be removed for the data to be considered as a number rather than a string.
You might wonder why we need 4 ‘String Manipulation’ nodes for performing this rather than one. If you look at the configuration window to the left you see that the output of ‘String Manipulation’ node either replaces the column or appends a new column. The node does not have the capability to replace different characters from different columns in one go and then replace them each in their corresponding column (or append each of these as a new column). There is a ‘String Manipulation (Multi-Column)’ node available in Knime. However this is not suitable for our need either because that node allows for manipulation of multiple columns — but for the same string/characters. In our case we need to manipulate different strings for different columns. Hence we will use 4x String Manipulation nodes and to avoid a cluttered look, we will collapse these into a Metanode.
Math Update — Data Cleaning:
Math Update is a meta node with 2 ‘Math Formula’ nodes. We are doing two tasks here — One is to update the price column so that the real price is displayed rather than a fractional price (in 10,000s). The 2nd task is to calculate ‘age’ of the car, i.e. difference between current year and car’s model year. This is being done to avoid a categorical variable for age of car and instead replace it with a numeric variable.
Categorial Variables — Data Manipulation:
The data that we read from the CSV has 4 different categorical variables — Location (Chennai, Delhi, Coimbatore etc), Transmission (Manual and Automatic), Fuel Type (CNG, Diesel, Petrol, Electric and LPG) and Owner Type (First, Second, Third and Fourth & above).
The ‘Category to Number’ node takes data that is categorized and assigns number to the different categories. Additionally it creates new columns with the category as a title so that the rows of these columns have the corresponding number associated with that category. For example if we have a car with fuel type as Diesel (let’s give it a 1), Automatic transmission (let’s give it a 0) and Owner type (First) is 1. Then there will be three columns with information 1, 0 and 1 which would tell the model that the data associated with these 3 columns refer to a ‘Diesel’ fueled car with ‘Automatic’ transmission and ‘One’ owner. A sample output of this node is given below.
The output of this node is passed through a ‘Column Filter’ where I filter out redundant columns — Location, Year, Fuel Type, Transmission, Owner Type and Price. All of these columns that have been removed have already been cleaned and modified using the steps described above.
Since the Mileage, Engine and Power columns had units associated with them in the original data, they were, by default, loaded as ‘String’ columns in Knime. The ‘String to number’ node converts these back to a numeric datatype. Since they have a decimal value, the type chosen here is ‘Double’. The output table will show the updated data type as ‘Double’ rather than ‘String’.
Since the data from Kaggle is missing values or null values for certain fields, we don’t want to include these in our model. The ‘Rule-based Row Filter’ allows the user to specify the action to be taken on each column’s rows. As you can see, I have filtered out ‘Missing’ values corresponding to each column and then removed them using the ‘Exclude TRUE Matches’ option in the configuration window.
Now that the data has been collected, read and cleaned, the next step is to model the data using a simple regression learner and predictor node.
The first step towards doing this is to partition the data using the ‘Partitioning’ node. I have chosen the default 70% — 30% partition between the train and test data. The output of this node has two ports — one for the training data and another for the test data.
One of the ports of the ‘Partitioning ’Node is connected to the ‘Linear Regression Learner’ node. In the configuration window we choose the Target variable (Updated_Price) and ensure that all the required columns are included. Note that in case you did not use a column filter earlier, this is a good chance to remove redundant columns so that your model output is not affected. In the off-chance that there is still some missing data (assuming that you didn’t use the Rule-based row filter or you missed out on writing the rule for a column) you can choose the ‘Ignore rows with missing values’ in the configuration window.
The output of the ‘Linear Regression Learner’ node consists of a Model for the predictor and 2 ports with coefficients. We connect the Model to the ‘Regression Predictor’ node and we connect the output of the ‘Partitioning’ Node (port # 2) to the Regression predictor node. The output port of the predictor node is connected to the Numeric Scorer so that we can evaluate how the model performed against the given data. The configuration for the ‘Numeric Scorer’ is shown here. The reference column is the column with the actual data and the ‘Predicted Column’ contains, well, the predicted data.
The output of the Linear Regression model is shown above. The R2 (goodness of fit measure) is 0.715 which indicates that there is about a 71% fit for the data in the regression model. The image on the bottom shows the coefficients from the Linear Regression Learner. The coefficients show how they affect the final output of the regression model.
In this article we have gone through the workflow for a simple linear regression model in Knime for used car price prediction. Depending on your application, the R2 may need further improvements and this can be done by working on data cleaning, data manipulation or even checking the data at its source. A good way to check this is to apply a similar workflow for a similar set of data collected from Kaggle or other data sources. A striking feature of this workflow is that it involved very minimal coding — even where coding was needed it was quite intuitive with the syntax provided by Knime. Python, R and SQL are holy grails for Data Scientists and Analysts globally. But Knime opens the door of Data Science to everyone through this low or no-code approach.
You can access the Knime Workflow at this link: Car_Price_Prediction — KNIME Hub