# Dynamic Pricing with Multi-Armed Bandit: Studying by Doing | by Massimiliano Costacurta | Aug, 2023

## Making use of Reinforcement Studying methods to real-world use instances, particularly in dynamic pricing, can reveal many surprises

Within the huge world of decision-making issues, one dilemma is especially owned by Reinforcement Studying methods: exploration versus exploitation. Think about strolling right into a on line casino with rows of slot machines (often known as “one-armed bandits”) the place every machine pays out a unique, unknown reward. Do you discover and play every machine to find which one has the very best payout, or do you stick to at least one machine, hoping it’s the jackpot? This metaphorical state of affairs underpins the idea of the Multi-armed Bandit (MAB) drawback. The target is to discover a technique that maximizes the rewards over a sequence of performs. Whereas exploration gives new insights, exploitation leverages the knowledge you already possess.

Now, transpose this precept to dynamic pricing in a retail state of affairs. Suppose you’re an e-commerce retailer proprietor with a brand new product. You aren’t sure about its optimum promoting value. How do you set a value that maximizes your income? Must you discover totally different costs to grasp buyer willingness to pay, or do you have to exploit a value that has been performing effectively traditionally? Dynamic pricing is actually a MAB drawback in disguise. At every time step, each candidate value level might be seen as an “arm” of a slot machine and the income generated from that value is its “reward.” One other strategy to see that is that the target of dynamic pricing is to swiftly and precisely measure how a buyer base’s demand reacts to various value factors. In less complicated phrases, the intention is to pinpoint the demand curve that finest mirrors buyer conduct.

On this article, we’ll discover 4 Multi-armed Bandit algorithms to guage their efficacy towards a well-defined (although not easy) demand curve. We’ll then dissect the first strengths and limitations of every algorithm and delve into the important thing metrics which might be instrumental in gauging their efficiency.

Historically, demand curves in economics describe the connection between the value of a product and the amount of the product that customers are prepared to purchase. They typically slope downwards, representing the frequent commentary that as value rises, demand sometimes falls, and vice-versa. Consider well-liked merchandise similar to smartphones or live performance tickets. If costs are lowered, extra folks have a tendency to purchase, but when costs skyrocket, even the ardent followers may assume twice.

But in our context, we’ll mannequin the demand curve barely otherwise: we’re placing value towards likelihood. Why? As a result of in dynamic pricing eventualities, particularly digital items or providers, it’s usually extra significant to assume when it comes to the chance of a sale at a given value than to invest on precise portions. In such environments, every pricing try might be seen as an exploration of the chance of success (or buy), which might be simply modeled as a Bernoulli random variable with a likelihood *p *relying on a given check value.

Right here’s the place it will get notably fascinating: whereas intuitively one may assume the duty of our Multi-armed Bandit algorithms is to unearth that very best value the place the likelihood of buy is highest, it’s not fairly so easy. Actually, our final purpose is to maximise the income (or the margin). This implies we’re not looking for the value that will get the most individuals to click on ‘purchase’ — we’re looking for the value that, when multiplied by its related buy likelihood, offers the very best anticipated return. Think about setting a excessive value that fewer folks purchase, however every sale generates vital income. On the flip aspect, a really low value may appeal to extra consumers, however the complete income may nonetheless be decrease than the excessive value state of affairs. So, in our context, speaking in regards to the ‘demand curve’ is considerably unconventional, as our goal curve will primarily characterize the likelihood of buy somewhat than the demand immediately.

Now, attending to the mathematics, let’s begin by saying that shopper conduct, particularly when coping with value sensitivity, isn’t at all times linear. A linear mannequin may counsel that for each incremental improve in value, there’s a relentless decrement in demand. In actuality, this relationship is usually extra advanced and nonlinear. One strategy to mannequin this conduct is by utilizing logistic features, which might seize this nuanced relationship extra successfully. Our chosen mannequin for the demand curve is then:

Right here, *a* denotes the utmost achievable likelihood of buy, whereas *b* modulates the sensitivity of the demand curve towards value adjustments. The next worth of *b* means a steeper curve, approaching extra quickly to decrease buy possibilities as the value will increase.

For any given value level, we’ll be then capable of receive an related buy likelihood, *p*. We will then enter *p* right into a Bernoulli random variable generator to simulate the response of a buyer to a specific value proposal. In different phrases, given a value, we are able to simply emulate our reward operate.

Subsequent, we are able to multiply this operate by the value so as to get the anticipated income for a given value level:

Unsurprisingly, this operate doesn’t attain its most in correspondence with the very best likelihood. Additionally, the value related to the utmost doesn’t rely on the worth of the parameter *a*, whereas the utmost anticipated return does.

With some recollection from calculus, we are able to additionally derive the method for the by-product (you’ll want to make use of a mixture of each the product and the chain rule). It’s not precisely a calming train, nevertheless it’s nothing too difficult. Right here is the analytical expression of the by-product of the anticipated income:

This by-product permits us to seek out the precise value that maximizes our anticipated income curve. In different phrases, by utilizing this particular method in tandem with some numerical algorithms, we are able to simply decide the value that units it to 0. This, in flip, is the value that maximizes the anticipated income.

And that is precisely what we’d like, since by fixing the values of *a* and *b*, we’ll instantly know the goal value that our bandits should discover. Coding this in Python is a matter of some strains of code: