Encoding high-cardinality categorical options | In the direction of Knowledge Science
Study to use goal encoding, rely encoding, function hashing and Embedding utilizing scikit-learn and TensorFlow
On this article, we are going to undergo 4 well-liked strategies to encode categorical variables with excessive cardinality: (1) Goal encoding, (2) Depend encoding, (3) Characteristic hashing and (4) Embedding.
We’ll clarify how every methodology works, talk about its professionals and cons and observe its affect on the efficiency of a classification process.
Desk of content material
— Introducing categorical features
(1) Why do we need to encode categorical features?
(2) Why one-hot encoding is not suited to high cardinality?
— Application on an AdTech dataset
— Overview of each encoding method
(1) Target encoding
(2) Count encoding
(3) Feature hashing
(4) Embedding
— Benchmarking the performance to predict CTR
— Conclusion
— To go further
Categorical options are a sort of variables that describe classes or teams (e.g. gender, shade, nation), versus numerical options that measure a amount (e.g. age, peak, temperature).
There are two forms of categorical information: ordinal options which classes could be ranked and sorted (e.g. sizes of T-shirt or restaurant scores from 1 to five star) and nominal options which classes don’t indicate any significant order (e.g. title of an individual, of a metropolis).
Why do we have to encode categorical options?
Encoding a categorical variable means discovering a mapping that converts a class to a numerical worth.
Whereas some algorithms can work with categorical information immediately (like determination bushes), most machine studying fashions can not deal with categorical options and had been designed to function with…