Data science | Data preprocessing using scikit learn| Coffee Quality database

Janvi Ajudiya
5 min readAug 21, 2021

--

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. There are numbers of methodologies of data preprocessing but our main focus is toward :

(1)Data Encoding

(2)Normalization

(3)Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset Description

Here i have used ‘Coffee Quality database’. This dataset contains information about quality measures like aroma, body, flavor, acidity, moisture, defect, balance, uniformity, etc. as well as bean metadata and farm metadata.

This dataset contains numeric as well as categorical data. Dataset also has different scaled columns and contains missing values. So this is the perfect dataset for preprocessing.

Dataset : Coffee Quality Database

Data Encoding

Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical attribute. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding (1)label encoding (2)Onehot encoding.

(1)Label encoding

If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.

As you can see ‘Number.of.Bags’ column has 131 categories that is nothing but house ranges. After Using Label Encoder we labeled the data. The 1062 housing range converted to 131, and so on…

classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: 0, 1 index: 1…)

(2)Onehot encoder

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.

Now it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world. But I advise you to do an experiment with both.

Normalization

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. because in real-world data is not available on the same scale. Data columns will always have different scales. So to make all the columns in one scale we can use normalization methods.

MinMaxScaler type normalization

MinMaxScaler : For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation(i.e. standard deviation = 1).

Imputing Missing Values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

We can handle missing values in two ways. : (1) Remove the data (whole row) which have missing values.(2) Add the values by using some strategies or using Imputer.

we can see altitude_low_meters, etc. attribute having 230 null/blank value

Simple Imputer

Discretization

Data discretization is the process of converting continuous data into discrete buckets by grouping it. by doing this we can limit the number of possible states. basically we convert the numerical features into categorical columns.

There are 3 types of Discretization available in Sci-kit learn.(1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform

Before discretization

Quantile Discretization Transform

Uniform Discretization Transform

KMeans Discretization Transform

Source : Github

--

--