Enterprise Analytics | Fixing Enterprise Case Examine Assignments


Transport and logistics, meals and purchasing, funds, day by day wants, enterprise, information and leisure, Gojek, an Indonesian agency does all of it by a cell app contributing greater than $7 billion to the economic system. It has 900K registered retailers, greater than 190M app downloads, and greater than 2M drivers delivering greater than 180K orders inside a span of 120 minutes. It’s a behemoth!  Utilizing enterprise analytics, we can be fixing a case research. Under is the final of the 20+ companies it gives:
  • Transport and Logistics
    • Go-ride – Your two-wheeler taxi, the indigenous Ojek
    • Go-car – Consolation on wheels. Sit again. Sleep. Snore.
    • Go-send – Ship or get packages delivered inside hours.
    • Go-box – Transferring out? We’ll do the weights.
    • Go-bluebird – Experience unique with the Bluebird.
    • Go-transit -Your commute assistant, with or with out Gojek
  • Meals & Purchasing
    • Go-mall – Store from an internet market
    • Go-mart – House supply from close by shops
    • Go-med – Purchase medicines, nutritional vitamins, and so on from licensed pharmacies.
  • Funds
    • Go-pay – Drop the pockets and go cashless
    • Go-bills – Pay payments, fast and easily
    • Paylater – Order now pay later.
    • Go-pulsa – Knowledge or speak time, top-up on the go.
    • Go-sure – Insure belongings you worth.
    • Go-give – Donate for what issues, contact lives.
    • Go-investasi – Make investments good, save higher.
  • Each day wants
    • GoFitness permits customers to entry workouts akin to yoga, pilates, pound match, barre, muay thai and Zumba.
  • Enterprise
    • Go-biz – A service provider #SuperApp to run and develop enterprise.
  • Information & Leisure
    • Go-tix – E book your present, Skip the queue.
    • Go-play – App for films and collection.
    • Go-games – Gaming suggestions developments and so on
    • Go-news – High information from high aggregators.

Knowledge generated by these companies is big and GO workforce has engineering options to deal with with day after day knowledge engineering points. Central Analytics and Science Crew(CAST) allows a number of merchandise throughout the Gojek ecosystem to effectively use the abundance of information concerned within the working of the app. The workforce has analysts, knowledge scientists, knowledge engineers, enterprise analysts, and choice scientists engaged on creating in-house deep analytics options and different ML methods.

GoJek central analysis of data science team | Business Analytics | Business case | Case study | Business

The analysts’ position is focused on fixing day-to-day enterprise issues, having good enterprise information, creating impression, deriving insights, RCA’s(root trigger evaluation), and maintaining high administration knowledgeable on micro in addition to macro metrics, and product selections to handle enterprise issues.

Studying Targets

  • RCA on development drivers and headwinds confronted by the organizations.
    • Utilizing Pandas for EDA, slicing, and dicing.
  • Advertising price range optimization
    • Income because the north star metric(L0 metric)
    • Utilizing Pulp solver to unravel LP.
    • Writing LP issues utilizing Pulps with clear and crisp directions.
  • Linear regression and cross-validation
    • Easy regression train utilizing the steps offered within the questionnaire.

This text was revealed as part of the Data Science Blogathon.

Drawback Assertion

Half I

GOJEK administrators have requested BI analysts to have a look at the information to grasp what occurred throughout Q1 2016 and what they need to do to maximise the income for Q2 2016.

  • Given the information in Drawback A, what are the principle issues that we have to deal with?
  • Given the information in Desk B, how will you maximize the revenue if we solely have a price range of IDR 40,000,000,000?
  • Current your findings and concrete options for a administration assembly.

Half II

  • Drawback Utilizing a number of linear regression, predict the total_cbv.
  • Create 1 mannequin for every service.
  • Forecast interval = 2016-03-30, 2016-03-31, and 2016-04-01
  • Practice interval = the remaining Checklist of predictors to make use of:
    • Day of month
    • Month
    • Day of week
    • Weekend/weekday flag (weekend = Saturday & Sunday)
  • Pre-processing (do it on this order):
    • Take away GO-TIX
    • Preserve solely `Cancelled` order_status
    • Guarantee the entire combos (cartesian product) of date and repair are current
    • Impute lacking values with 0
    • Create is_weekend flag predictor (1 if Saturday/Sunday, 0 if different days)
    • One-hot encode month and day of week predictors
    • Standardize all predictors into z-scores utilizing the imply and normal deviation from train-period knowledge solely
  • Analysis metric: MAPE Validation: 3-fold scheme. Every validation fold has the identical size because the forecast interval.
  • Query 1 – After all of the pre-processing steps, what’s the worth of all of the predictors for service = GO-FOOD, date = 2016-02-28?
  • Query 2 – Present the primary 6 rows of one-hot encoded variables (month and day of the week)
  • Query 3 – Print the primary 6 rows of the information after pre-processing for service = GO-KILAT. Type ascendingly by date
  • Query 4 – Compute the forecast-period MAPE for every service. Show in ascending order primarily based on the MAPE
  • Query 5 – Create graphs to indicate the efficiency of every validation fold. One graph one service. x = date, y = total_cbv. Colour: black = precise total_cbv, different colours = the fold predictions (there must be 3 different colours). Solely present the validation interval. For instance, if rows 11, 12, and 13 have been used for validations, then don’t present the opposite rows within the graphs. Clearly present the month and date on the x-axis

Half III

Our GO-FOOD service in Surabaya carried out very properly final month – they’d 20% extra accomplished orders final month than the month earlier than. The supervisor of GO-FOOD in Surabaya must see what is going on as a way to consistently keep this success for the subsequent month onwards.

  • What quantitative strategies would you utilize to guage the sudden development? How would you consider the purchasers’ habits?


The Resolution to Half One

Earlier than starting to unravel, begin researching blogs and whitepapers which are current on the corporate web site(hyperlinks are added beneath). Firm archives present helpful assets that act as guides and assist perceive what the corporate stands for or what the corporate is anticipating out of this position. Questions one and three might be thought-about open-ended issues. Query two is an easy train on regression, not essentially specializing in the very best mannequin, however the focus is on the processes concerned in constructing a mannequin.

RCA on Development Drivers and Headwinds Confronted by the Organizations

Import knowledge:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
#import csv

sales_df =pd.read_csv('')

print("Form of the df")


print("NULL CHECK")

print("NULL CHECK")

print("df INFO")

Business Analytics | Business case | Case study | Business

Create pandas datetime from object format. Pandas datetimes is a straightforward format to work with and manipulate dates. Derive the month column from datetime. Filter out month 4(April) as properly. Rename months as Jan, Feb, March.

## convert up to now time 
# convert order_status to strinf

time_to_pandas_time = ["date"]

for cols in time_to_pandas_time:
  sales_df[cols] = pd.to_datetime(sales_df[cols])


sales_df['Month'] = sales_df['date'].dt.month 


sales_df[sales_df['Month'] !=4]

Q1_2016_df = sales_df[sales_df['Month'] !=4]

Q1_2016_df['Month'] = np.the place(Q1_2016_df['Month'] == 1,"Jan",np.the place(Q1_2016_df['Month'] == 2,"Feb",np.the place(Q1_2016_df['Month'] == 3,"Mar","Apr")))



#import csv

On the group stage, general income has grown by 14%. This can be a optimistic consequence. Let’s break this down by numerous companies and determine companies which are performing properly.

revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)


revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()

top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ] 

ninety_five_perc_gmv = listing(top_95_revenue.service.distinctive())

top_95_revenue_plot = top_95_revenue[["Jan", "Feb", "Mar"]]
top_95_revenue_plot.index = top_95_revenue.service

## share of income is modified however has the general income modified for these high 4 companies#import csv
revenue change | Business Analytics | Business case | Case study | Business
  • For all three months, Experience, Meals, Store, and Ship contribute to greater than 90% internet income share.(In Jan Experience contributed to 51% of internet income.)
  • Therefore following the 80:20 rule for the latest month, we are able to prohibit this evaluation to the highest 3 companies, specifically – Experience, Meals, Ship.
  • Out of the 11 accessible companies, solely 3 contribute to greater than 90% of income. This can be a explanation for concern and there may be immense alternative for the remainder of the companies to develop.

Accomplished Rides

## NET - accomplished rides
Q1_2016_df_pivot_cbv_4 = Q1_2016_df[Q1_2016_df["order_status"] == "Accomplished"]
Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv_4[Q1_2016_df_pivot_cbv_4.service.isin(ninety_five_perc_gmv)]

Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv_4.pivot_table(index='service', columns=['Month' ], values="total_cbv", aggfunc="sum")
# show(Q1_2016_df_pivot_cbv.head())
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan",	"Feb",		"Mar"]]

for cols in Q1_2016_df_pivot_cbv.columns:



## We see that go store as lowered its income however others the income is fixed. 

Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv
Q1_2016_df_pivot_cbv_4.reset_index(inplace = True)

Q1_2016_df_pivot_cbv_4["Feb_jan_growth"] = (Q1_2016_df_pivot_cbv_4.Feb / Q1_2016_df_pivot_cbv_4.Jan -1)*100
Q1_2016_df_pivot_cbv_4["Mar_Feb_growth"] = (Q1_2016_df_pivot_cbv_4.Mar / Q1_2016_df_pivot_cbv_4.Feb -1)*100

show(Q1_2016_df_pivot_cbv_4)#import csv
Gojek completed rides | Business Analytics | Business case | Case study | Business
Gojek services | Business Analytics | Business case | Case study | Business
  • Experience – which is the revenue-driving engine has grown by 19%(Jan to March) in comparison with Ship which has grown by 25%.
  • Meals has degrown by 7%, given meals supply as a enterprise is rising across the globe, and it is a main explanation for concern.

Canceled Rides(Misplaced Alternative)

Q1_2016_df_pivot_cbv = Q1_2016_df[Q1_2016_df["order_status"] != "Accomplished"]
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv.pivot_table(index='service', columns=['Month' ], values="total_cbv", aggfunc="sum")
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan",	"Feb",		"Mar"]]

revenue_total = pd.DataFrame()

for cols in Q1_2016_df_pivot_cbv.columns:

revenue_total.reset_index(inplace = True)

overall_cbv =     Q1_2016_df_pivot_cbv.sum()

overall_cbv =     Q1_2016_df_pivot_cbv.sum()
overall_cbv_df = pd.DataFrame(knowledge = overall_cbv).T

overall_cbv_df["Feb_jan_growth"] = (overall_cbv_df.Feb / overall_cbv_df.Jan -1)*100
overall_cbv_df["Mar_Feb_growth"] = (overall_cbv_df.Mar / overall_cbv_df.Feb -1)*100


revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)


revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()

top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ] 

ninety_five_perc_gmv = listing(top_95_revenue.service.distinctive())
Cancelled rides | Business Analytics | Business case | Case study | Business
  • Misplaced income has grown by 6%.
  • Administrators can improve their efforts to scale back this to lower than 5%.

Evaluation of Orders

Q1_2016_df_can_com = Q1_2016_df[Q1_2016_df.order_status.isin(["Cancelled", "Completed"])]
Q1_2016_df_can_com = Q1_2016_df_can_com[Q1_2016_df_can_com.service.isin(ninety_five_perc_gmv)]

Q1_2016_df_pivot = Q1_2016_df_can_com.pivot_table(index='service', columns=['order_status','Month' ], values="num_orders", aggfunc="sum")
Q1_2016_df_pivot.fillna(0, inplace = True)

multi_tuples =[
                ('Cancelled', 'Jan'),
               ('Cancelled', 'Feb'),
            ('Cancelled', 'Mar'),
            ('Completed', 'Jan'),
            ('Completed', 'Feb'),
            ('Completed', 'Mar')]

multi_cols = pd.MultiIndex.from_tuples(multi_tuples, names=['Experiment', 'Lead Time'])

Q1_2016_df_pivot = pd.DataFrame(Q1_2016_df_pivot, columns=multi_cols)


Q1_2016_df_pivot.columns = ['_'.join(col) for col in Q1_2016_df_pivot.columns.values]

#import csv

Q1_2016_df_pivot["jan_total"] = Q1_2016_df_pivot.Cancelled_Jan  + Q1_2016_df_pivot.Completed_Jan
Q1_2016_df_pivot["feb_total"] = Q1_2016_df_pivot.Cancelled_Feb  + Q1_2016_df_pivot.Completed_Feb
Q1_2016_df_pivot["mar_total"] = Q1_2016_df_pivot.Cancelled_Mar  + Q1_2016_df_pivot.Completed_Mar

Q1_2016_df_pivot[ "Cancelled_Jan_ratio"	]    =Q1_2016_df_pivot.Cancelled_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[	"Cancelled_Feb_ratio"	]=Q1_2016_df_pivot.Cancelled_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[	"Cancelled_Mar_ratio"	]=Q1_2016_df_pivot.Cancelled_Mar/Q1_2016_df_pivot.mar_total
Q1_2016_df_pivot[	"Completed_Jan_ratio"	]=Q1_2016_df_pivot.Completed_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[	"Completed_Feb_ratio"	]=Q1_2016_df_pivot.Completed_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[	"Completed_Mar_ratio" ]  =Q1_2016_df_pivot.Completed_Mar/Q1_2016_df_pivot.mar_total

Q1_2016_df_pivot_1 = Q1_2016_df_pivot[["Cancelled_Jan_ratio"

Analysis of order | Business Analytics | Business case | Case study | Business
  • In March, Meals, Experience, Ship had 17%,15%, and 13% of whole orders canceled respectively.
  • Meals has elevated its order completion charge, from 69% in January to 83% in March. This can be a vital enchancment.
## column smart cancellation verify if elevated
perc_of_cols_orders = pd.DataFrame()

for cols in Q1_2016_df_pivot.columns:

perc_of_cols_cbv.T.plot(sort='bar', stacked=True)
perc_of_cols_orders.T.plot(sort='bar', stacked=True)
  • In March, of all of the rides canceled, Experience has 72% share of orders, adopted by Meals(17%) and ship(6%).

Abstract of Findings and Suggestions for Enterprise Analytics

  • Experience –
    • The highest contributor to income.
    • Cancellation(GMV) in March has grown by 42%
    • Scale back cancelations by product intervention and new product options.
  • Meals –
    • Canceled orders have elevated, however as a result of value optimization, GMV loss has been efficiently arrested.
    • Improve internet income by decreasing prices and cancellations.
    • Drive larger buyer acquisition.
  • Ship –
    • Canceled GMV and orders, each have taken a success and are a serious explanation for concern.
    • Good trip completion expertise, thus, growing retention and powering income development by retention.

Maximize Income By Optimizing Finances Spends

The Enterprise workforce has a price range of 40 Billon for Q2 and it has set development targets for every service. For every service, the price of incremental 100 rides and the utmost development goal in Q2 is given beneath.  For Go-Field, to get 100 extra bookings, it prices 40M, and the utmost development goal in Q2 is 7%.

Maximise the profits | Business Analytics | Business case | Case study | Business

Import price range knowledge and use gross sales knowledge from the above evaluation.

budget_df =pd.read_csv('')

print("Form of the df")


print("NULL CHECK")

print("NULL CHECK")

print("df INFO")


## convert up to now time 
# convert order_status to string

time_to_pandas_time = ["date"]

for cols in time_to_pandas_time:
  sales_df[cols] = pd.to_datetime(sales_df[cols])


sales_df['Month'] = sales_df['date'].dt.month 


sales_df_q1 = sales_df[sales_df['Month'] !=4]
## Assumptions
sales_df_q1 = sales_df_q1[sales_df_q1["order_status"] == "Accomplished"]

# Q1_2016_df_pivot = Q1_2016_df.pivot_table(index='service', columns=['order_status','Month' ], values="num_orders", aggfunc="sum")

sales_df_q1_pivot = sales_df_q1.pivot_table(index='service', columns=['order_status'], values="total_cbv", aggfunc="sum")
sales_df_q1_pivot_orders = sales_df_q1.pivot_table(index='service', columns=['order_status'], values="num_orders", aggfunc="sum")

sales_df_q1_pivot.reset_index(inplace = True)
sales_df_q1_pivot.columns = ["Service","Q1_revenue_completed"]

sales_df_q1_pivot_orders.reset_index(inplace = True)
sales_df_q1_pivot_orders.columns = ["Service","Q1_order_completed"]

optimization_Df = pd.merge(


optimization_Df = pd.merge(


optimization_Df.columns = ["Service",	"Q1_revenue_completed",	"Cost_per_100_inc_booking",	"max_q2_growth_rate","Q1_order_completed"]
#import csv
Business Analytics | Business case | Case study | Business
  • For Field, Q1 income is 23B, the associated fee for incremental 100 rides is 40M, its most anticipated development charge is 7% and 63K whole rides have been accomplished @ 370K per order.

Is it attainable to attain the utmost development charge for all of the companies with an accessible price range of 40B?

## If all service max development is to be achived what's the price range wanted? and whats the deficiet?
optimization_Df["max_q2_growth_rate_upd"]   = optimization_Df['max_q2_growth_rate'].str.extract('(d+)').astype(int)                 ## extract int from string
optimization_Df["max_growth_q2_cbv"]        = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/100))  ## Q2 max orders primarily based on Q1 orders
optimization_Df["abs_inc_orders"]           = optimization_Df.max_growth_q2_cbv-optimization_Df.Q1_order_completed                   ## Whole improve in orders 
optimization_Df["cost_of_max_inc_q2_order"] = optimization_Df.abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100         ## Whole Value to get most development for every serivce


show(budget_df[budget_df["Service"] == "Finances:"].reset_index())
budget_max = budget_df[budget_df["Service"] == "Finances:"].reset_index()
budget_max = budget_max.iloc[:,2:3].values[0][0]
print("Finances distinction by")
show(budget_max-optimization_Df.cost_of_max_inc_q2_order.sum() )

## Due to this fact max of the the whole lot can't be achieved#import csv

The reply is No. 247B(247,244,617,204) extra price range is required to attain development targets for all companies.

Is it attainable to attain at the very least 10% of the utmost development charge for all of the companies with an accessible price range of 40B?

## Then what's the price range wanted and what's going to the additional price range at hand??
optimization_Df["min_10_max_growth_q2_cbv"]        = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/1000))  ## atleast 10% of max if achieved, that is orders 
optimization_Df["min_10_abs_inc_orders"]           = optimization_Df.min_10_max_growth_q2_cbv-optimization_Df.Q1_order_completed             ## what's the improve in orders wanted to attain 10% orders development
optimization_Df["min_10_cost_of_max_inc_q2_order"] = optimization_Df.min_10_abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100   ## Value associatedfor 10% improve in orders 

show(budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum() )  ## Whole price range remaining

show((budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum())/budget_max)  ## Finances utilization proportion 

optimization_Df["perc_min_10_max_growth_q2_cbv"] =( ( optimization_Df.max_q2_growth_rate_upd/1000))  ## atleast 10% of max if achieved, 7 to % divide by 100, 10% of this quantity. divide by 10, so 1000
optimization_Df["perc_max_growth_q2_cbv"]        =( ( optimization_Df.max_q2_growth_rate_upd/100))   ## Max development to be achieved

optimization_Df["q1_aov"] = optimization_Df.Q1_revenue_completed/optimization_Df.Q1_order_completed  ## Q1 common order worth
optimization_Df["order_profitability"] = 0.1  ## that is assumption that 10% can be revenue

optimization_Df["a_orders_Q2"]        = (optimization_Df.Q1_order_completed *(1+ optimization_Df.perc_min_10_max_growth_q2_cbv))  ## primarily based on 10% development, whole new orders for qc

optimization_Df["a_abs_inc_orders"]   = optimization_Df.a_orders_Q2-optimization_Df.Q1_order_completed

optimization_Df["a_Q2_costs"] = optimization_Df.Cost_per_100_inc_booking* optimization_Df.a_abs_inc_orders/100

##There's scope for enchancment right here, so This may be adjusted primarily based on income or rating from Q1
show(budget_max - optimization_Df.a_Q2_costs.sum())

optimization_Df#import csv

The reply is Sure. With solely 28% of the accessible 40B price range, this may be achieved. Underutilization of the accessible price range is rarely an choice, and no enterprise chief would use solely 28% of the accessible price range.

So, the utmost development throughout all companies can’t be achieved, and attaining 10% of the utmost development charge will result in an underutilized price range. Therefore the necessity right here is to optimize spending such that:

  • The general money burn doesn’t cross 40B.
  • The general development charge in Q2 throughout companies is the same as or beneath the utmost development charge.
  • There are known as constraints in Linear optimization.
  • The target is to Maximize income.

Assumptions used right here:

  • Each service has a revenue of 10%.
  • AOV(income/orders) will stay the identical as in Q1.

Pre-optimization knowledge pipeline:

## Knowledge prep for pulp optimization
perc_all_df = pd.DataFrame(knowledge = listing(vary(1,optimization_Df.max_q2_growth_rate_upd.max()+1)), columns = ["growth_perc"])  
## create a listing of all proportion development, from 1 to max to development anticipated, that is to create simulation for optimization

optimization_Df_2 = optimization_Df.merge(perc_all_df, how = "cross")  ## cross be part of with opti DF

## Filter and maintaining all percentgaes upto most for every service
## Minimal proportion saved is 1
optimization_Df_2["filter_flag"] = np.the place(optimization_Df_2.max_q2_growth_rate_upd >= (optimization_Df_2.growth_perc),1,0)
optimization_Df_2["abs_profit"] =  (optimization_Df_2.q1_aov)*(optimization_Df_2.order_profitability)
optimization_Df_3 = optimization_Df_2[optimization_Df_2["filter_flag"] == 1]


## Filter columns wanted
optimization_Df_4 = optimization_Df_3[[
    'Service',                       ## services offered
    'Cost_per_100_inc_booking',      ## cost of additional 100 orders
    'Q1_order_completed',            ## to calculate q2 growth based on q1 orders
    'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need 
    'perc_max_growth_q2_cbv',        ## max growth percent allowed
    'abs_profit',                    ## profit per order 
    'growth_perc'                    ## to simulative growth percet across


optimization_Df_4["orders_Q2"]        = (optimization_Df_4.Q1_order_completed *(1+ optimization_Df_4.growth_perc/100))  ## primarily based on development, whole new orders for qc
optimization_Df_4["abs_inc_orders"]   = optimization_Df_4.orders_Q2-optimization_Df_4.Q1_order_completed
optimization_Df_4["profit_Q2_cbv"]    = optimization_Df_4.orders_Q2 * optimization_Df_4.abs_profit
optimization_Df_4["growth_perc"]      = optimization_Df_4.growth_perc/100
optimization_Df_4["Q2_costs"]         = optimization_Df_4.Cost_per_100_inc_booking* optimization_Df_4.abs_inc_orders/100


optimization_Df_5 = optimization_Df_4[[
    'Service',                       ## services offered
    'Q2_costs',                      ## cost total for the growth expected
    'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need 
    'perc_max_growth_q2_cbv',        ## max growth percent allowed
    'profit_Q2_cbv',                 ## total profit at the  assumed order_profitability rate
    'growth_perc'                    ## to simulative growth percet across



Understanding the Optimization Dataset

  • Service – Go product.
  • 10% of max development, is the minimal development that every service ought to obtain. So Field ought to at the very least obtain 0.7% development.
  • Max development determined by enterprise leaders for Field is 7%.
  • For Field, 1% to 7% is the vary of development.1% is greater than 0.7% and seven% is the utmost. The optimizer will select the very best development charge primarily based on constraints.
    • This can be a choice variable. The algorithm will decide one amongst 7.
  • For 1% development(Incremental), the money burn is 255M.
  • If incremental development is 1%, then general revenue(natural + inorganic) is 2.4B.
## Greatest optimization for our case case. That is good. 

prob = LpProblem("growth_maximize", LpMaximize)    ## Initialize optimization drawback - Maximization drawback

optimization_Df_5.reset_index(inplace = True, drop = True)   
markdowns = listing(optimization_Df_5['growth_perc'].distinctive())   ## Checklist of all development percentages
cost_v    = listing(optimization_Df_5['Q2_costs'])               ## Checklist of all incremental value to attain the expansion % wanted

perc_min_10_max_growth_q2_cbv = listing(optimization_Df_5['perc_min_10_max_growth_q2_cbv'])
growth_perc                   = listing(optimization_Df_5['growth_perc'])

## lp variables
low     = LpVariable.dicts("l_", perc_min_10_max_growth_q2_cbv, lowBound = 0, cat = "Steady")
development  = LpVariable.dicts("g_", growth_perc, lowBound = 0, cat = "Steady")
delta   = LpVariable.dicts ("d", markdowns, 0, 1, LpBinary)
x       = LpVariable.dicts ("x", vary(0, len(optimization_Df_5)), 0, 1, LpBinary)

## goal perform - Maximise revenue, column identify - profit_Q2_cbv
## Assign worth for every of the rows -
## For all rows within the desk every row can be assidned x_0, x_1, x_2 and so on and so on
## That is later used to filter the optimum development % 
prob += lpSum(x[i] * optimization_Df_5.loc[i, 'profit_Q2_cbv'] for i in vary(0, len(optimization_Df_5)))

## one distinctive development percentahe for every service
## Constraint one 
for i in optimization_Df_5['Service'].distinctive():
                prob += lpSum([x[idx] for idx in optimization_Df_5[(optimization_Df_5['Service'] == i) ].index]) == 1

## Don't cross whole price range
## Constraint two
prob += (lpSum(x[i] * optimization_Df_5.loc[i, 'Q2_costs'] for i in vary(0, len(optimization_Df_5))) - budget_max) <= 0 

## constraint to say minimal must be achived
for i in vary(0, len(optimization_Df_5)):
  prob +=  lpSum(x[i] * optimization_Df_5.loc[i, 'growth_perc'] ) >=  lpSum(x[i] * optimization_Df_5.loc[i, 'perc_min_10_max_growth_q2_cbv'] )

prob.writeLP('markdown_problem')     ## Write Drawback identify
prob.clear up()                         ## Clear up Drawback
show(LpStatus[prob.status])       ## Drawback standing - Optimum, if drawback solved efficiently
show(worth(prob.goal))       ## Goal, on this case what's the maximized revenue with availble price range  - 98731060158.842 @ 10% revenue per order #import csv

Understanding The best way to Write An LP Drawback is Key to Fixing it

  • Initialize the issue
    • prob = LpProblem(“growth_maximize”, LpMaximize)
    • growth_maximize is the identify of the issue.
    • LpMaximize is letting the solver know that it’s a maximization drawback.
  • Create a variable of the choice perform
    • development = LpVariable.dicts(“g_”, growth_perc, lowBound = 0, cat = “Steady”)
    • For Pulp, pulp dicts must be created
    • g_ is the prefix for the variable.
    • growth_perc is the identify of the listing
    • low sure is the minimal development %, it may well begin from 0.
    • The variable is steady.
    • There are 60 distinctive development percentages from 1%(minimal) to 60%(most). (Meals has a 60% most development charge).
    • Variables – 0 <= x_0 <= 1 Integer for row 0 to 0 <= x_279 <= 1 Integer for row 279.
  • Add goal perform to the issue
    • prob += lpSum(x[i] * optimization_Df_5.loc[i, ‘profit_Q2_cbv’] for i in vary(0, len(optimization_Df_5)))
    • An equation is created by pulp -> 2423147615.954*x_0 + 2447139176.5080004*x_1 + 225916468.96*x_3+ …. + 8576395.965000002*x_279. There are 280 rows within the dataset, so for every revenue worth, a variable is created.
  • Add constraint:
    • One – One development proportion for every service
      • for i in optimization_Df_5[‘Service’].distinctive(): prob += lpSum([x[idx] for idx in optimization_Df_5[(optimization_Df_5[‘Service’] == i) ].index]) == 1
      • For every service, solely choose one development %.
      • For Field out of 1 to 7 choose just one.
      • The equation for field – _C1: x_0 + x_1 + x_2 + x_3 + x_4 + x_5 + x_6 = 1
      • The equation for GLAM – _C2: x_10 + x_11 + x_12 + x_13 + x_14 + x_15 + x_16 + x_7 + x_8 + x_9 = 1
      • As there are 11 companies so 11 constraints are created, one for every service.
    • Two – Don’t cross the entire price range of 40B
      • prob += (lpSum(x[i] * optimization_Df_5.loc[i, ‘Q2_costs’] for i inrange(0, len(optimization_Df_5))) – budget_max) <= 0
      • The sum of all prices minus the entire price range must be lower than or equal to zero.
      • Equation _C12: 255040000 x_0 + 510080000 x_1 + …. + 16604 x_279 <= 0
      • _C12: is the one constraint right here as a result of, there may be one whole price range of 40B, and there’s no constraint on how a lot every service can spend.
    • Three – constraint to say minimal must be achieved
      • for i in vary(0, len(optimization_Df_5)): prob += lpSum(x[i] * optimization_Df_5.loc[i, ‘growth_perc’] ) >= lpSum(x[i] * optimization_Df_5.loc[i, ‘perc_min_10_max_growth_q2_cbv’] )
      • For every row, the minimal development % constraint equation is created. There are 279 rows, so 279 constraints are created.
      • _C13: 0.003 x_0 >= 0 from row 0 to _C292: 0.315 x_279 >= 0 to row 279.
    • Optimum‘” is the specified output.
      • show(LpStatus[prob.status])
    • 98731060158.842 is the maximized revenue.
      • show(worth(prob.goal))
var_name = []
var_values = []
for variable in prob.variables():
    if 'x' in variable.identify:

outcomes = pd.DataFrame()

outcomes['variable_name'] = var_name
outcomes['variable_values'] = var_values
outcomes['variable_name_1'] = outcomes['variable_name'].apply(lambda x: x.cut up('_')[0])
outcomes['variable_name_2'] = outcomes['variable_name'].apply(lambda x: x.cut up('_')[1])
outcomes['variable_name_2'] = outcomes['variable_name_2'].astype(int)
outcomes.sort_values(by='variable_name_2', inplace=True)
outcomes.drop(columns=['variable_name_1', 'variable_name_2'], inplace=True)
outcomes.drop(columns="index", axis=1, inplace=True)

# outcomes.head()

optimization_Df_5['variable_name'] = outcomes['variable_name'].copy()
optimization_Df_5['variable_values'] = outcomes['variable_values'].copy()
optimization_Df_5['variable_values'] = optimization_Df_5['variable_values'].astype(int)# optimization_Df_6.head()

#import csv## with no price range contraint
optimization_Df_10 = optimization_Df_5[optimization_Df_5['variable_values'] == 1].reset_index()

optimization_Df_10["flag"] = np.the place(optimization_Df_10.growth_perc >= optimization_Df_10.perc_min_10_max_growth_q2_cbv,1,0)


show(budget_max - optimization_Df_10.Q2_costs.sum())
show( optimization_Df_10.Q2_costs.sum())
  • The utmost development charge for respective companies is within the chart above. For Field it’s 1%, for Clear it’s 1%, for Meals it’s 17%, and so on.
  • The overall money burn is – 39999532404.0
  • Underutilized price range – 467596.0
  • Maximized revenue – 98731060158.0

The Resolution to Half Two

sales_df =pd.read_csv('')

time_to_pandas_time = ["date"]

for cols in time_to_pandas_time:
  sales_df[cols] = pd.to_datetime(sales_df[cols])

sales_df['Month'] = sales_df['date'].dt.month 

Q1_2016_df = sales_df[sales_df['Month'] !=900]

Q1_2016_df['Month'] = np.the place(Q1_2016_df['Month'] == 1,"Jan",np.the place(Q1_2016_df['Month'] == 2,"Feb",np.the place(Q1_2016_df['Month'] == 3,"Mar","Apr")))

Q1_2016_df['test_control'] = np.the place(Q1_2016_df['date'] <= "2016-03-30","prepare", "check")




#import csv
  • Import dataset
  • Convert date to pandas datetime
  • Derive month columns
  • Derive prepare and check columns

Q1_2016_df_2 = Q1_2016_df[Q1_2016_df["date"] <= "2016-04-01"]

Q1_2016_df_2 = Q1_2016_df_2[Q1_2016_df["order_status"]  == "Cancelled"] 

Q1_2016_df_date_unique = Q1_2016_df_2[["date"]].drop_duplicates()
Q1_2016_df_date_service = Q1_2016_df_2[["service"]].drop_duplicates()

Q1_2016_df_CJ = Q1_2016_df_date_unique.merge(Q1_2016_df_date_service, how = "cross")  ## cross be part of with opti DF


Q1_2016_df_3 = Q1_2016_df_CJ.merge(Q1_2016_df_2, on=['date','service'], how='left', suffixes=('_x', '_y'))


Q1_2016_df_3["total_cbv"].fillna(0, inplace = True)
print("Null verify ",Q1_2016_df_3.isnull().values.any())

nan_rows = Q1_2016_df_3[Q1_2016_df_3['total_cbv'].isnull()]


Q1_2016_df_3["dayofweek"] = Q1_2016_df_3["date"].dt.dayofweek
Q1_2016_df_3["dayofmonth"] = Q1_2016_df_3["date"]

Q1_2016_df_3["Is_Weekend"] = Q1_2016_df_3["date"].dt.day_name().isin(['Saturday', 'Sunday'])

  • Filter for under canceled orders.
  • For all companies, cross be part of with dates from Jan 01 to Apr 01, in order that predictions for all days can be found.
  • Change NULL with 0.
  • Derive day of the month
  • Derive day of the week.
  • Create binary weekend/weekday column
Q1_2016_df_4 = Q1_2016_df_3[Q1_2016_df_3["service"] != "GO-TIX"]

Q1_2016_df_5 = pd.get_dummies(Q1_2016_df_4, columns=["Month","dayofweek"])


import numpy as np
import pandas as pd
# from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import imply
from numpy import std
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_predict


all_columns = ['date', 'service', 'num_orders', 'order_status', 'total_cbv',
                'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
                'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
                'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']

model_variables = [ 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
                  'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
                  'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']

target_Variable = ["total_cbv"]

all_columns = ['service', 
                'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
                'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
                'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
  • Filter out GO-TIX
  • One sizzling encode – Month and day of the week
  • Import all the mandatory libraries
  • Create a listing of columns, prepare, predictor, and so on.

model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] =="GO-FOOD"]

check  = model_1[model_1["test_control"]!="prepare"]
prepare = model_1[model_1["test_control"]=="prepare"]

X = prepare[model_variables]
y = prepare[target_Variable]

train_predict = model_1[model_1["test_control"]=="prepare"]
x_ = X[model_variables]

sc = StandardScaler()
X_train = sc.fit_transform(X)
X_test = sc.rework(x_)
  • Filter knowledge for one service – GO-FOOD
  • Create prepare and check dataframes
  • Create X – with prepare columns, and y with predictor column.
  • Use Standardscalar for z-score transformation.
#outline customized perform which returns single output as metric rating
def NMAPE(y_true, y_pred): 
    return 1 - np.imply(np.abs((y_true - y_pred) / y_true)) * 100

#make scorer from custome perform
nmape_scorer = make_scorer(NMAPE)

# put together the cross-validation process
cv = KFold(n_splits=3, random_state=1, shuffle=True)
# create mannequin
mannequin = LinearRegression()
# consider mannequin
scores = cross_val_score(mannequin, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
# report efficiency
print('Accuracy: %.3f (%.3f)' % (imply(scores), std(scores)))

y_pred = cross_val_predict(mannequin, X, y,  cv=cv)
  • cross_val_score doesn’t have MAPE as an in-built scorer, so outline MAPE.
  • Create CV occasion
  • Create LR occasion
  • Use  cross_val_score to get the typical MAPE scores throughout CV Folds for GO-Meals.
  • For every service, this code might be lopped, create a perform to create

def go_model(Q1_2016_df_5, go_service,model_variables,target_Variable):
  model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] ==go_service]

  check  = model_1[model_1["test_control"]!="prepare"]
  prepare = model_1[model_1["test_control"]=="prepare"]

  X = prepare[model_variables]
  y = prepare[target_Variable]

  train_predict = model_1[model_1["test_control"]=="prepare"]
  x_ = X[model_variables]

  X_train = sc.fit_transform(X)
  X_test = sc.rework(x_)

  # put together the cross-validation process
  cv = KFold(n_splits=3, random_state=1, shuffle=True)
  # create mannequin
  mannequin = LinearRegression()
  # consider mannequin
  scores = cross_val_score(mannequin, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
  # report efficiency
  print('Accuracy: %.3f (%.3f)' % (imply(scores), std(scores)))

  y_pred = cross_val_predict(mannequin, X, y,  cv=cv)

  return y_pred,imply(scores), std(scores)

a,b,c = go_model(Q1_2016_df_5, "GO-FOOD",model_variables,target_Variable)

  • Modeling steps transformed to a perform:
    • Q1_2016_df_5 – Base knowledge
    • go_service – go-tix, go-send and so on
    • model_variables – variables used to coach the mannequin
    • target_Variable – predictor variable(total_cbv).
  • For every service, the strategy might be run to get the typical forecast MAPE throughout all 11 companies.

The Resolution to Half Three

Query 3 is an open-ended query and readers are inspired to unravel it on their very own. Among the hypotheses are:

  • As that is particular to 1 particle space and geography, it’s secure to imagine that the APP roughly remained the identical, and product interventions may have performed solely a minor position. And if there was product intervention, it was simply particular to this specific space.
  • Good high quality/well-known eating places and meals chains have been onboarded, and customers now have lot of excellent choice to order from or order from acquainted eating places.
  • The supply velocity was considerably improved by onboarding a better variety of supply brokers.
  • Re-trained supply brokers successfully to scale back cancelations.
  • Labored with restaurant companions, to deal with peak-time chaos in a greater approach.

Helpful Sources and References

  • Working In The ‘Central Analytics and Science Crew’
  • How We Estimate Meals Debarkation Time With ‘Tensoba’
  • Enterprise Case Examine Assignments For Entry Degree Knowledge Analysts
  • Fixing Enterprise Case Examine Assignments For Knowledge Scientists
  • Using Knowledge To Respect Our Clients
  • Under the Hood of Gojek’s Automated Forecasting Software
  • Experimentation at Gojek
  • GO-JEK’s Influence for Indonesia
  • GO-FAST: The Knowledge Behind Ramadan
  • Pulp optimization.
  • Linear programming utilizing pulp.
  • Marketing marketing campaign optimization.
  • Simple methods to optimize one thing utilizing python.


Case research, when carried out proper, following the steps given above, can have a optimistic impression on the enterprise. Recruiters aren’t in search of solutions however an method to these solutions, the construction adopted, the reasoning used, and enterprise and sensible information utilizing enterprise analytics. This text gives an easy-to-follow framework for knowledge analysts utilizing an actual enterprise case research for instance.

Key Takeaways:

  • There are two approaches to reply this case research, bottom-up, top-down. Right here, bottom-up method has been thought-about, due to unfamiliarity with knowledge and unavailability of enterprise context.
  • Slicing and dicing the gross sales numbers throughout dimensions, figuring out developments and patterns throughout companies, is the very best method to determine the challenges for development.
  • Be crisp and to the purpose, whereas offering suggestions.
  • Let the information inform a narrative, as a substitute of simply proving knowledge factors – Eg: The highest three companies contribute in the direction of greater than 90% of income. Whereas on the group stage, development is on the optimistic aspect, at numerous companies, there are challenges with trip completion, driver cancellation, and so on. For Meals – Decreasing cancelations by Y% will drive larger revenues in Q2 by x% and so on.
  • Optimization utilizing pulp is intimidating when there are greater than 3 constraints. Writing down an LP drawback on a chunk of paper, then coding it out will certainly make the duty simpler.

Good luck! Right here’s my Linkedin profile if you wish to join with me or wish to assist enhance the article. Be happy to ping me on Topmate/Mentro; you possibly can drop me a message together with your question. I’ll be completely happy to be linked. Take a look at my different articles on knowledge science and analytics right here.

The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Writer’s discretion.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button