It’s well-known that many machine studying fashions can’t course of categorical options natively. Whereas there are some exceptions, it’s often as much as the practitioner to determine on a numeric illustration of every categorical characteristic. There are many ways to perform this, however one technique seldom beneficial is label encoding.
Label encoding replaces every categorical worth with an arbitrary quantity. For example, if we now have a characteristic containing letters of the alphabet, label encoding may assign the letter “A” a price of 0, the letter “B” a price of 1, and proceed this sample till “Z” which is assigned 25. After this course of, technically talking, any algorithm ought to be capable of deal with the encoded characteristic.
However what’s the issue with this? Shouldn’t subtle machine studying fashions be capable of deal with the sort of encoding? Why do libraries like Catboost and other encoding strategies exist to take care of excessive cardinality categorical options?
This text will discover two examples demonstrating why label encoding may be problematic for machine studying fashions. These examples will assist us recognize why there are such a lot of alternatives to label encoding, and it’ll deepen our understanding of the connection between information complexity and mannequin efficiency.
Top-of-the-line methods to realize instinct for a machine studying idea is to know the way it works in a low dimensional house and attempt to extrapolate the consequence to greater dimensions. This psychological extrapolation doesn’t all the time align with actuality, however for our functions, all we’d like is a single characteristic to see why we’d like higher categorical encoding methods.
A Characteristic With 25 Classes
Let’s begin by taking a look at a primary toy dataset with one characteristic and a steady goal. Listed here are the dependencies we’d like:
import numpy as np
import polars as pl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split