Encoding high-cardinality categorical options | In the direction of Knowledge Science

Study to use goal encoding, rely encoding, function hashing and Embedding utilizing scikit-learn and TensorFlow

“Click on” — Photograph by Cleo Vermij on Unsplash

On this article, we are going to undergo 4 well-liked strategies to encode categorical variables with excessive cardinality: (1) Goal encoding, (2) Depend encoding, (3) Characteristic hashing and (4) Embedding.

We’ll clarify how every methodology works, talk about its professionals and cons and observe its affect on the efficiency of a classification process.

Desk of content material

Introducing categorical features
(1) Why do we need to encode categorical features?
Why one-hot encoding is not suited to high cardinality?
Application on an AdTech dataset
Overview of each encoding method
(1) Target encoding
Count encoding
Feature hashing
Benchmarking the performance to predict CTR
To go further

Categorical options are a sort of variables that describe classes or teams (e.g. gender, shade, nation), versus numerical options that measure a amount (e.g. age, peak, temperature).

There are two forms of categorical information: ordinal options which classes could be ranked and sorted (e.g. sizes of T-shirt or restaurant scores from 1 to five star) and nominal options which classes don’t indicate any significant order (e.g. title of an individual, of a metropolis).

Why do we have to encode categorical options?

Encoding a categorical variable means discovering a mapping that converts a class to a numerical worth.

Whereas some algorithms can work with categorical information immediately (like determination bushes), most machine studying fashions can not deal with categorical options and had been designed to function with…

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button