This blog is about the data preprocessing using the Orange tool. Visit the profile for previous blogs. In this blog I will be discuss about how you can use the Orange library in python and perform various data preprocessing tasks like Discretization, , Randomization, and Normalization on data with help of various Orange functions.
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. In this blog we will preprocess our data using python(Preprocessing is crucial for achieving better-quality analysis results).
We will use Orange tool in this demo. If not install then Install it and run the Orange tool.
In the Orange tool canvas, take the Python script from the left panel and double click on it.
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. In this example I have taken the built in dataset provided by Orange namely iris which classifies the flowers based on their characteristics. For performing discretization Discretize function is used.
#Python Script for Discreetizationiris = Orange.data.Table("iris.tab")disc = Orange.preprocess.Discretize()disc.method = Orange.preprocess.discretize.EqualFreq(n=3)d_iris = disc(iris)print("Original dataset:\n")for e in iris[:3]:print(e)print("Discretized dataset:")for e in d_iris[:3]:print(e)
Original data set and Discretized dataset
Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.
- binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument
- multinomial variables are treated according to the argument
- discrete attribute with only one possible value are removed.
The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one and others will be zero. This is the default behaviour.
For example as shown in the below code snippet, dataset “titanic” has feature “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.
#python script for Continuization
import Orangetitanic = Orange.data.Table("titanic")continuizer = Orange.preprocess.Continuize()titanic1 = continuizer(titanic)print("Before Continuization : ",titanic.domain)print("After Continuization : ",titanic1.domain)#Data of row 15 in the before and after continuizationprint("15th row data before : ",titanic)print("15th row data after : ",titanic1)
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because of other attribute having values on larger scale. We use the Normalize function to perform normalization.
#python script for Normalizationfrom Orange.data import Tablefrom Orange.preprocess import Normalizedata = Table("iris")normalizer = Normalize(norm_type=Normalize.NormalizeBySpan)normalized_data = normalizer(data)print("Before Normalization : ",iris)print("After Normalization : ",normalized_data)
With randomization, given a data table, preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.
#python script for Randomizefrom Orange.data import Tablefrom Orange.preprocess import Randomizedata = Table("iris")randomizer = Randomize(Randomize.RandomizeClasses)randomized_data = randomizer(data)print("Before randomization : ",iris)print("After Randomization : ",randomized_data)
So this is all for this blog, we use various preprocessing functions in Orange library for data preprocessing. Hope you get what you want.