Machine learning algorithms do not understand strings. Hence, we need to convert the input data into numeric before passing it on to the algorithms for training.

You can skip the numeric conversion of the string target variable while doing classification, as it is handled by the algorithms.

The predictor variables could be of two types,

  • Ordinal Variable: Categorical strings which have some natural ordering, for example, the Size column can be ordered automatically like S<M<L. Or the priority of support tickets like P1>P2>P3 etc. Hence, while converting them to numeric, we must assign such numeric values that represent the natural ordering of the variables. Like S<M<L can be represented by 1<2<3.
  • Nominal Variable: Categorical strings that do NOT have any natural ordering, for example, Gender, Colors, etc. Here if the number of unique values is two in a variable, then we call it a Binary variable, and special treatment is done by replacing values as 0 and 1. When there are more than two unique values, then we create a dummy variable for each unique type.

Convert ordinal categorical to numeric

Consider the below data, this contains three categorical string variables, Gender, Department, and Rating. Out of these, Rating is ordinal and the other two are nominal variables.

You can convert the ordinal variable to numeric by providing a mapping for each unique value. For example, here we know that Rating-A is better than Rating-B, and Rating-B is better than Rating-C

Hence, 3>2>1 can represent the order A>B>C.

This order must be known to you while converting any ordinal categorical data. Some times it will not be obvious, then you must use your business domain knowledge or consult with a business analyst to confirm it.

The mapping can be done using the replace() function of a Pandas Series.

Sample Output

Converting an ordinal variable to numeric in python
Converting an ordinal variable to numeric in python

Convert nominal categorical to numeric

Nominal can be of two types, Binary(only two unique values), and Multiclass(more than two unique values).

When it is binary, then we map it as 0 and 1 in the same column using the replace() function. When it is multiclass we create dummy variables using the get_dummies() function.

Sample Output:

Converting binary and multiclass nominal variables to numeric in python
Converting binary and multiclass nominal variables to numeric in python

Notice the use of the get_dummies() function. We pass the full data to it. The get_dummies() function, ignores all the numeric variables present in the data, picks up all the string variables, converts them to dummies, and deletes the original variables. This saves us a lot of effort!

Since you can convert all the multiclass nominal variables together at once by get_dummies(), this operation is carried out at the end. Hence, the order of conversion is listed below

  1. First, ordinal variables are converted using replace() one variable at a time
  2. Binary nominal variables are converted using replace() one variable at a time
  3. All multiclass nominal variables are converted using get_dummies() at once

After this, the data is ready for machine learning!

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!