This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms

Zhihua Zhu
Tencent
   Zheng Cai
Tencent
   Liang Zheng
Tencent
   Nian Si
Booth School of Business,
University of Chicago
Corresponding author: [email protected].
(March 3, 2025)
Abstract

Two-sided platforms are central to modern commerce and content sharing and often utilize A/B testing for developing new features. While user-side experiments are common, seller-side experiments become crucial for specific interventions and metrics. This paper investigates the effects of interference caused by feedback loops on seller-side experiments in two-sided platforms, with a particular focus on the counterfactual interleaving design, proposed in Ha-Thuc et al. (2020); Nandy et al. (2021). These feedback loops, often generated by pacing algorithms, cause outcomes from earlier sessions to influence subsequent ones. This paper contributes by creating a mathematical framework to analyze this interference, theoretically estimating its impact, and conducting empirical evaluations of the counterfactual interleaving design in real-world scenarios. Our research shows that feedback loops can result in misleading conclusions about the treatment effects.

1 Introduction

Two-sided platforms have increasingly integrated into our daily routines. We utilize shopping platforms like Amazon and Taobao to purchase and sell goods. Video-sharing platforms such as TikTok and Kuaishou allow us to view and upload content. Moreover, platforms like Booking.com and Airbnb have simplified the process of renting out or booking accommodations. A common workflow across these platforms involves users (typically buyers) initiating a request (session), which the platform then processes to match them with a ranked list of sellers or providers.

To ensure optimal user experience, these platforms routinely conduct experiments (A/B tests) before implementing new features. While user-side experiments are predominant, seller-side experiments, also known as supply-side, producer-side, ad-side, creator-side, or video-side experiments in various contexts, are also necessary when user-side experiments are either infeasible or inappropriate. For instance, certain interventions, such as updating a seller’s user interface, can only be applied to sellers. Furthermore, when the objective is to measure metrics like seller retention, seller-side experiments are more suitable.

In naive seller-side experiments, interference often presents more intensely. Interference means that the treatment assignment of some units could affect the outcomes of others (Imbens and Rubin, 2015). To illustrate this challenge, let consider an advertisement recommendation system in a video-sharing platform. Imagine we’re evaluating a new algorithm designed to boost new ads, a new “cold start” strategy. In a naive seller-side experiment, let’s say we boost 50% of new ads, which consists of the treatment group, leaving the other half untouched. Due to this boost, ads within the treatment group naturally achieve a higher ranking. Yet, if we were to boost all new ads, the ones in our initial treatment group would actually descend in rank because of the increased volume of videos receiving the same boost. As a result, the data from the experiment could overstate the true impact.

To address this issue, Ha-Thuc et al. (2020) and Nandy et al. (2021) propose a counterfactual interleaving design and Wang and Ba (2023) enhance the design with a novel tie-breaking rule to guarantee consistency and monotonicity. In this approach, a subset of ads is randomly divided into control ads and treatment ads. At the same time, during the ranking phase, both the control strategy and treatment strategy are applied to rank all ads. These two ranking strategies can be referred to as Ranking C and Ranking T, respectively. The results of these two rankings are then merged to produce a final order, M. The merging process uses the order of control ads in Ranking C as their order in M, while the order of treatment ads in Ranking T determines their order in M. If there’s a position conflict, it’s resolved randomly: one ad retains its spot, while the other is shifted down a slot. Consequently, the placement of ads in the treatment group in the final ranking approximates their rank when all ads are sorted using the treatment strategy. Similarly, the placement of control group ads is nearly identical to their rank under the control strategy.

Following its introduction, this methodology was extensively implemented across major online platforms, such as Facebook, TikTok, and Kuaishou. Although the method appeared promising initially, our implementation and analysis on our platform revealed substantial interference, particularly in settings with feedback loops.

In contemporary recommendation systems, rankings in subsequent sessions can be influenced by the rankings of earlier sessions, primarily due to feedback loops. Take advertising recommendations as an example. Each ad campaign has a preset budget, which outlines a specified amount of money to be spent within a given time frame. To adhere to these budget limits, platforms often employs budget control mechanisms (Karlsson, 2020): they raise the bidding price when the remaining budget is high, and lower it when the budget is nearing depletion. Consequently, if an ad is shown more often in initial sessions, it may be recommended less in later sessions to balance spending, and the opposite is also true. This feedback loop creates the interference.

Feedback loops are not exclusive to the context of advertising. In e-commerce, sales rates must be regulated to align with the inventory level, while on video-sharing platforms, a cold start algorithm for new videos must carefully manage the rate of exposure to balance between discovering new content and promoting popular content. To encompass these various situations, we refer to these feedback control algorithms as ”pacing algorithms.” A typical workflow for platforms that incorporate a pacing algorithm is depicted in Figure 1.

ML models Pacing algorithm Compute training scores and adjust them based on the pacing algorithm Recommend to users based on the ranking User interacts with the recommended items Record users’ behaviors Oberservation
Figure 1: A standard pipeline in platforms with a pacing algorithm.

Although pacing algorithms are essential in platform operations, they create significant interference that can affect the outcomes of seller-side experiments, including both the naive one and the counterfactual interleaving designs. In this paper, we will clarify these issues using mathematical models and support our explanations with empirical evidence from the real world. Here, we outline our main contributions:

  1. 1.

    We develop a framework to represent seller-side experiments that are influenced by interference from feedback loops. We specifically consider the naive design and the counterfactual interleaving design when evaluating different types of features.

  2. 2.

    Using this framework, we illustrate the presence of interference and provide a theoretical estimate of how it biases results.

  3. 3.

    We evaluate the counterfactual interleaving design in a real-world setting affected by feedback loops. Our analysis reveals that feedback loops can lead to incorrect conclusions about the effect of a treatment. This understanding also helps us introduce a straightforward method for detecting such interference in practice.

The remainder of this paper is structured as follows: Section 2 reviews relevant research on A/B tests under interference, as well as the effects of feedback loops on platform operations. Section 3 provides an overview of the two types of seller-side experiments examined in this paper. In Section 4, we develop mathematical models to investigate the impact of feedback loops on these experiments. Section 5 strengthens our analysis with real-world A/B testing data. Finally, we conclude this paper with future work in Section 6. All the proofs are included in Appendix A.

2 Related Literature

2.1 Interference in Experiments

The presence of interference is a well-documented phenomenon in academic research. Blake and Coey (2014); Holtz et al. (2020); Fradkin (2015) showed empirical evidence that the bias introduced by interference can be as significant as the treatment effect itself.

Many researchers have analyzed interference and proposed new experimental designs in two-sided platforms. Johari et al. (2022) and Bajari et al. (2021) introduced two-sided randomizations, also referred to as multiple randomization designs. By blocking the treatment-control interactions, Ye et al. (2023) proposed a similar yet different two-sided split design. Additionally, Bright et al. (2022) modelled the two-sided marketplaces using a linear programming matching mechanism and developed debiased estimators through shadow prices. The concept of bipartite experiments, where treatments are assigned to one group of units and metrics are measured in another, was presented in works by Eckles et al. (2017); Pouget-Abadie et al. (2019); Harshaw et al. (2023). Furthermore, the application of cluster experiments in marketplaces was demonstrated in studies by Holtz et al. (2020); Holtz and Aral (2020).

For seller-side experiments, Ha-Thuc et al. (2020) and Nandy et al. (2021) introduced a counterfactual interleaving design widely implemented in the industry and Wang and Ba (2023) enhanced the design with a novel tie-breaking rule, as mentioned earlier.

Regarding advertising experiments, Liu et al. (2021) proposed a budget-split design and Si et al. (2022) employed a weighted local linear regression estimation to address imbalances in budget allocation between treatment and control groups.

For interference induced by feedback loops, Si (2023) studied the specific data training loop and proposed a weighted training approach, where Holtz et al. (2023) also studied a similar problem, which they refer to as “Symbiosis Bias.” Additionally, Goli et al. (2023) developed a bias-correction technique using data from past A/B tests to tackle such interference. For evaluating bandit learning algorithms, Guo et al. (2023) suggested a two-stage experimental design to assess the treatment effects’ lower and upper bounds. Additionally, in the context of search ranking systems, Musgrave et al. (2023) recommended query-randomized experiments to reduce feature spillover effects.

There are also other types of interference studied in the literature, including Markovian interference (Farias et al., 2022, 2023; Glynn et al., 2020; Hu and Wager, 2022; Li et al., 2023), temporal interference (Bojinov et al., 2023; Hu and Wager, 2022; Xiong et al., 2023b, a; Basse et al., 2023; Xiong et al., 2019; Ni et al., 2023), and network interference (Hudgens and Halloran, 2008; Gui et al., 2015; Li and Wager, 2022; Aronow and Samii, 2013; Candogan et al., 2023; Ugander et al., 2013; Ugander and Yin, 2023; Yu et al., 2022).

2.2 Feedback Loops in Recommendation Systems

The presence of feedback loops in recommendation systems has been well-documented in the literature for a long time. Pan et al. (2021) discussed the concept of user feedback loops and strategies for mitigating their biases. Jadidinejad et al. (2020) explored the impact of feedback loops on the underlying models. Additionally, Yang et al. (2023) and Khenissi (2022) identified fairness issues arising from these loops. Furthermore, Chaney et al. (2018), Mansoury et al. (2020), and Krauth et al. (2022) studied how feedback loops amplify homogeneity and popularity biases. In our work, we specifically focus on the impact of feedback loops on seller-side experiments.

3 Seller-Side Experiments

In the section, we brief talk about the platform pipelines and the implementation of seller-side experiments.

Naive seller-side experiments. In a typical naive seller-side experiment, sellers (such as advertisements, creators, or videos) are randomly assigned to either a treatment group or a control group. The treatment group sellers are equipped with a new feature, while the control group sellers continue using the existing production feature. This new feature could be a user interface modification for the seller or a different campaign budget control algorithm. When a user request is received, a score is calculated for each seller, which may be adjusted by the feedback control algorithms. If the treatment and control groups use different feedback control algorithms, the score adjustments will vary between them. Subsequently, the scores from both groups are combined, and the sellers with the highest scores are recommended to the user. The metrics for users or sellers are then recorded. At the end of the experiment, the treatment effect is calculated by comparing the average metrics of the treatment group with those of the control group.

Counterfactual interleaving design. In the counterfactual interleaving design (Ha-Thuc et al., 2020; Nandy et al., 2021), sellers are randomly distributed into three groups: treatment, control, and the other group. The inclusion of the other group is essential for minimizing conflicts between treatment and control rankings. This design is distinct from the naive seller-side experiments in terms of how it computes scores and ranks sellers. Both control and treatment strategies are used to calculate seller scores. These scores generate two separate ranking lists, termed Ranking C (control) and Ranking T (treatment). A combined ranking, Ranking M, is then created, positioning control group sellers according to Ranking C and treatment group sellers according to Ranking T. Ha-Thuc et al. (2020) indicates that conflicts in seller positions are rare when the other group is large, making any tie-breaking rule effective. In contrast, when treatment and control groups are larger, selecting an appropriate tie-breaking rule becomes crucial, as discussed in Nandy et al. (2021); Wang and Ba (2023). After assigning positions to treatment and control sellers, those in the other group fill the remaining slots. This process is illustrated in Figure 2.

Refer to caption
Figure 2: Counterfactual interleave design.

4 A Model for seller-side experiments under Interference Induced by Feedback Loops

When the treatment and control algorithms differ and produce different user-interaction dynamics, the seller-side experiments would be unreliable and significantly biased by the presence of feedback loops. In this section, we develop a foundational analytical model to analyze seller-side experiments under interference induced by feedback loops. In Section 4.1, we describe the recommendation system pipeline and the assumptions we make, excluding the experiment design. Following this, in Section 4.2, we examine seller-side experiments that do not include feedback loops. In contrast, in Section 4.3, we focus on seller-side experiments that incorporate feedback loops.

4.1 Recommendation System Pipeline

We focus on NN sellers, denoted by i=1,2,,N.i=1,2,\ldots,N. The duration of our experiment is HH. User requests are received continuously over the interval [0,H][0,H]. The ranking scores of seller ii at time tt, both before and after adjustments made by feedback control algorithms, are represented by e^i(t)\hat{e}_{i}(t) and ri(t)r_{i}(t), respectively. We use Ii(t)I_{i}(t) as an indicator to signify whether seller ii is recommended for a request at time tt. The true metric for seller ii at time tt, assuming they are recommended, is denoted by ei(t)e_{i}(t). We define Oi(t)=Ii(t)ei(t)O_{i}(t)=I_{i}(t)e_{i}(t) to represent the observed metrics at time tt for seller ii.

Si(t)S_{i}(t) is designated as a function to represent a one-dimensional state process, subject to the influence of feedback loops. In the advertising context, for example, Si(t)S_{i}(t) could indicate the budget consumption at a specific time tt. In e-commerce, it might refer to the volume of sales at time tt. Similarly, in a cold start scenario, Si(t)S_{i}(t) represents the current number of exposures a video has received. Then, based on the notations defined above, we able to model the dynamics of a recommendation system’s pipeline.

Oi(t)\displaystyle O_{i}(t) =\displaystyle= Ii(t)ei(t),\displaystyle I_{i}(t)e_{i}(t), (1)
Ii(t)\displaystyle I_{i}(t) =\displaystyle= 𝕀{fi({r1(t),,rN(t)},R¯(t))0},\displaystyle\mathbb{I}\left\{f_{i}(\{r_{1}(t),\ldots,r_{N}(t)\},\underline{R}(t))\geq 0\right\}, (2)
ri(t)\displaystyle r_{i}(t) =\displaystyle= Ψ(Si(t),e^i(t)),\displaystyle\Psi(S_{i}(t),\hat{e}_{i}(t)), (3)
dSi(t)dt\displaystyle\frac{dS_{i}(t)}{dt} =\displaystyle= Γ(Si(t),ei(t),ri(t),Ii(t),t),\displaystyle\Gamma\left(S_{i}(t),e_{i}(t),r_{i}(t),I_{i}(t),t\right), (4)

where R¯(t)\underline{R}(t) is some reserve utility for a request at time t.t. In environments where ads and non-ad content, such as videos, are ranked together, such as on video-sharing platforms, R¯(t)\underline{R}(t) could signify the anticipated utility of recommending a non-advertisement video.

To avoid the technique issues, the differential equation (4) should be understood as the following integral equation:

Si(t)=Si(0)+0tΓ(Si(t),ei(s),ri(s),Ii(s),s)𝑑s.S_{i}(t)=S_{i}(0)+\int_{0}^{t}\Gamma\left(S_{i}(t),e_{i}(s),r_{i}(s),I_{i}(s),s\right)ds.

We make the following monotonicity assumptions real-world observations and behaviors.

Assumption 1.

We assume

  1. 1.

    fi({r1,,rN},R¯)f_{i}(\{r_{1},\ldots,r_{N}\},\underline{R}) is non-increasing with respect to r1,,ri1,r_{1},\ldots,r_{i-1}, ri+1,,rNr_{i+1},\ldots,r_{N} and ifi({r1,,rN},R¯)\sum_{i\in\mathcal{I}}f_{i}(\{r_{1},\ldots,r_{N}\},\underline{R}) is non-decreasing with respect to rir_{i}, provided that ii\in\mathcal{I}.

  2. 2.

    Ψ(s,e^)\Psi(s,\hat{e}) is non-increasing with respect to ss.

  3. 3.

    Γ(,,,,)0\Gamma\left(\cdot,\cdot,\cdot,\cdot,\cdot\right)\geq 0.

  4. 4.

    Ψ(s,e^)\Psi(s,\hat{e}) is non-decreasing with respect to e^\hat{e}.

Assumption 1 is clarified as follows: Assumption 1.(1) corresponds to the fact that the recommendation system select the highest scores. For example, if one item is recommended, then the system might use the formula

fi({r1,,rN},R¯)=rimax(r1,,ri1,ri+1,,rN,R¯).f_{i}(\{r_{1},\ldots,r_{N}\},\underline{R})=r_{i}-\max(r_{1},\ldots,r_{i-1},r_{i+1},\ldots,r_{N},\underline{R}). (5)

This formula selects the one highest value among the scores. Assumptions 1.(2) and (3) are based on the fact that Si()S_{i}(\cdot) represents current consumptions. Consequently, Si()S_{i}(\cdot) is inherently non-decreasing. Furthermore, pacing algorithms typically exhibit a damping effect; they reduce the ranking of an item when its current consumption is high, and conversely increase it when consumption is low. Finally, Assumption 1.(4) implies that a higher estimated score prior to adjustment leads to a higher ranking score.

We plot the causal graph (Pearl, 2000) in Figure 3 to illustrate the dependence in the feedback loops.

e^i(t)\hat{e}_{i}(t)ri(t)r_{i}(t)r1(t)r_{1}(t)rN(t)r_{N}(t)Ii(t)I_{i}(t)Oi(t)O_{i}(t)Si(t+)S_{i}(t+)Si(t)S_{i}(t)\vdots\vdots
Figure 3: Dependence of different objects in the feedback loops.

Our goal is to estimate the Global Treatment Effect (GTE), defined as the difference in metrics observed under global treatment and global control conditions. We describe this process as follows: within the global treatment condition, the procedure is outlined as

OiGT(t)\displaystyle O_{i}^{GT}(t) =\displaystyle= IiGT(t)eiGT(t),\displaystyle I_{i}^{GT}(t)e_{i}^{GT}(t), (6)
IiGT(t)\displaystyle I_{i}^{GT}(t) =\displaystyle= 𝕀{fi({r1GT(t),,rNGT(t)},R¯(t))0},\displaystyle\mathbb{I}\left\{f_{i}(\{r_{1}^{GT}(t),\ldots,r_{N}^{GT}(t)\},\underline{R}(t))\geq 0\right\}, (7)
riGT(t)\displaystyle r_{i}^{GT}(t) =\displaystyle= ΨT(SiGT(t),e^iGT(t)),\displaystyle\Psi^{T}(S_{i}^{GT}(t),\hat{e}_{i}^{GT}(t)), (8)
dSiGT(t)dt\displaystyle\frac{dS_{i}^{GT}(t)}{dt} =\displaystyle= Γ(SiGT(t),eiGT(t),riGT(t),IiGT(t),t).\displaystyle\Gamma\left(S_{i}^{GT}(t),e_{i}^{GT}(t),r_{i}^{GT}(t),I_{i}^{GT}(t),t\right). (9)

Similarly, within the global control regime, we have:

OiGC(t)\displaystyle O_{i}^{GC}(t) =\displaystyle= IiGC(t)eiGC(t),\displaystyle I_{i}^{GC}(t)e_{i}^{GC}(t), (10)
IiGC(t)\displaystyle I_{i}^{GC}(t) =\displaystyle= 𝕀{fi({r1GC(t),,rNGC(t)},R¯(t))0},\displaystyle\mathbb{I}\left\{f_{i}(\{r_{1}^{GC}(t),\ldots,r_{N}^{GC}(t)\},\underline{R}(t))\geq 0\right\}, (11)
riGC(t)\displaystyle r_{i}^{GC}(t) =\displaystyle= ΨC(SiGC(t),e^iGC(t)),\displaystyle\Psi^{C}(S_{i}^{GC}(t),\hat{e}_{i}^{GC}(t)), (12)
dSiGC(t)dt\displaystyle\frac{dS_{i}^{GC}(t)}{dt} =\displaystyle= Γ(SiGC(t),eiGC(t),riGC(t),IiGC(t),t).\displaystyle\Gamma\left(S_{i}^{GC}(t),e_{i}^{GC}(t),r_{i}^{GC}(t),I_{i}^{GC}(t),t\right). (13)

Here, we assume SiGC(0)=SiGT(0)S_{i}^{GC}(0)=S_{i}^{GT}(0) and riGC(0)=riGT(0)r_{i}^{GC}(0)=r_{i}^{GT}(0), for i=1,2,,Ni=1,2,\ldots,N. The GTE is thereby defined as

GTE=1Ni=1N0H(OiGT(t)OiGC(t))𝑑t.\mathrm{GTE}=\frac{1}{N}\sum_{i=1}^{N}\int_{0}^{H}\left(O_{i}^{GT}(t)-O_{i}^{GC}(t)\right)dt. (14)

4.2 Seller-Side Experiments without Feedback Loops

We consider an experiment that randomly assigns a seller to the treatment group with probability pp and to the control group with probability 1p1-p. We use 𝒯\mathcal{T} and 𝒞\mathcal{C} to denote the treatment and control groups respectively: 𝒯𝒞=\mathcal{T}\cap\mathcal{C}=\emptyset and 𝒯𝒞={1,2,,N}\mathcal{T}\cup\mathcal{C}=\{1,2,\ldots,N\}. We do not specifically consider the other group since the other group could be absorbed into the reserve utility R¯()\underline{R}(\cdot) and we ignore position conflicts in our model. However, incorporating the other group will not change any results we shall present.

We will consider two types of treatment features: improving item performance ei(t)e_{i}(t), for example, 1) enhancing a seller’s content and product descriptions to increase customer purchase likelihood; and 2) testing the ranking algorithm ri(t)r_{i}(t), such as promoting certain sellers or making the estimated e^i(t)\hat{e}_{i}(t) more closely align with the true metrics ei(t){e}_{i}(t).

4.2.1 Naive seller-side experiments

We first write down the dynamics for naive seller-side experiment without interference by feedback loops:

OiE(t)\displaystyle O_{i}^{E}(t) =IiE(t)eiE(t),\displaystyle=I_{i}^{E}(t)e_{i}^{E}(t), (15)
IiE(t)\displaystyle I_{i}^{E}(t) =𝕀{fiNE(t)0},\displaystyle=\mathbb{I}\left\{f_{i}^{NE}(t)\geq 0\right\}, (16)
riE(t)\displaystyle r_{i}^{E}(t) =e^iE(t),\displaystyle=\hat{e}_{i}^{E}(t), (17)

where E{C,T}E\in\{C,T\} indicates the control or treatment assignment, with E=TE=T for members of the treatment group i𝒯i\in\mathcal{T} and E=CE=C for those in the control group i𝒞i\in\mathcal{C}, and the threshold function is defined

fiNE(t)=fi({riT(t),i𝒯}{riC(t),i𝒞},R¯(t))f_{i}^{NE}(t)=f_{i}\left(\left\{r_{i}^{T}(t),i\in\mathcal{T}\right\}\cup\left\{r_{i}^{C}(t),i\in\mathcal{C}\right\},\underline{R}(t)\right) (18)

Th GTE estimator is defined as

GTE^=1Npi𝒯0HOiT(t)𝑑t1N(1p)i𝒞0HOiC(t)𝑑t.\widehat{GTE}=\frac{1}{Np}\sum_{i\in\mathcal{T}}\int_{0}^{H}O_{i}^{T}(t)dt-\frac{1}{N(1-p)}\sum_{i\in\mathcal{C}}\int_{0}^{H}O_{i}^{C}(t)dt. (19)

In the dynamics (15)-(18), we simplify the process by excluding the state Si()S_{i}(\cdot) and directly equate the ranking score ri(t)r_{i}(t) to the estimated score e^i(t)\hat{e}_{i}(t) from the ML models. Additionally, the threshold function fiNE()f_{i}^{NE}(\cdot) (where NE stands for Naive seller-side Experiments) is designed to simply aggregate the ranking scores from both the treatment and control groups.

Case 1: Testing item performance ei(t)e_{i}(t). In this case we assume e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H]. Proposition 1 below shows that the naive seller-side experiments are unbiased in this case without feedback loops.

Proposition 1.

Suppose that e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,2,,Ti=1,2,\ldots,T and t[0,H]t\in[0,H]. Then, we have the GTE estimator (19) is unbiased of the true GTE (14), i.e., 𝔼[GTE^]=GTE.\mathbb{E}[\widehat{GTE}]=\mathrm{GTE}.

Proposition 1 suggests that if experimenters believe the treatment affects only the true item performance without impacting the rankings, and the effects from feedback loops are negligible, then using the naive design is sufficient.

Case 2: Testing the ranking algorithm ri(t)r_{i}(t), where we assume e^iC(t)e^iT(t)\hat{e}_{i}^{C}(t)\neq\hat{e}_{i}^{T}(t) in general. In this case, this naive design is biased because IiT(t)IiGT(t),IiC(t)IiGC(t)I_{i}^{T}(t)\neq I_{i}^{GT}(t),I_{i}^{C}(t)\neq I_{i}^{GC}(t) generally lead to different items being recommended in the experimental versus global treatment/control regimes. For instance, if we consider boosting specific sellers, which is represented by e^iT(t)>e^iC(t)\hat{e}_{i}^{T}(t)>\hat{e}_{i}^{C}(t). Then, the treatment sellers would typically be ranked higher, exhibiting a cannibalization effects. then sellers under treatment would generally be ranked higher, leading to cannibalization effects. As a result, the estimator tends to overestimate the true GTE, that is,

𝔼[GTE^]GTE.\mathbb{E}[\widehat{GTE}]\geq\mathrm{GTE}.

This phenomenon aligns with the ”cold start” examples discussed in the Introduction. However, we will see that the counterfactual interleaving design can address this issue, provided that the system is not influenced by feedback loops.

4.2.2 Counterfactual interleaving design.

The counterfactual interleaving design differs from the naive design solely in the dynamics of the recommendation item II I, as defined in the equation (16). Here, II follows

IiT(t)\displaystyle I_{i}^{T}(t) =𝕀{fiCE,T(t)0}, for i𝒯 and\displaystyle=\mathbb{I}\left\{f_{i}^{CE,T}(t)\geq 0\right\},\text{ for }i\in\mathcal{T}\text{ and } (20)
IiC(t)\displaystyle I_{i}^{C}(t) =𝕀{fiCE,C(t)0}, for i𝒞\displaystyle=\mathbb{I}\left\{f_{i}^{CE,C}(t)\geq 0\right\},\text{ for }i\in\mathcal{C} (21)

where

fiCE,T(t)\displaystyle f_{i}^{CE,T}(t) =fi({riT(t),i=1,2,,N},R¯(t)) and\displaystyle=f_{i}\left(\left\{r_{i}^{T}(t),i=1,2,\ldots,N\right\},\underline{R}(t)\right)\text{ and } (22)
fiCE,C(t)\displaystyle f_{i}^{CE,C}(t) =fi({riC(t),i=1,2,,N},R¯(t)).\displaystyle=f_{i}\left(\left\{r_{i}^{C}(t),i=1,2,\ldots,N\right\},\underline{R}(t)\right). (23)

Here, CE stands for Counterfactual interleaving Experiments. To compute fiCE,T(t)f_{i}^{CE,T}(t) and fiCE,C(t)f_{i}^{CE,C}(t), it is necessary to maintain two versions of ranking scores for all sellers: one for the treatment group and one for the control group.

riT(t)=e^iT(t) and riC(t)=e^iC(t),r_{i}^{T}(t)=\hat{e}_{i}^{T}(t)\text{ and }r_{i}^{C}(t)=\hat{e}_{i}^{C}(t), (24)

for all i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H].

Remark 1.

In our continuous model, we can assume there are no position conflicts almost everywhere. Therefore, In dynamics (20) and (21), it is safe to ignore the conflicts.

Proposition 2 below demonstrates that the counterfactual interleaving designs are unbiased for both types of treatment features, updating ei(t)e_{i}(t) and e^i(t)\hat{e}_{i}(t).

Table 1: Comparing the bias in naive seller-side experiments and counterfactual interleaving design for different types of treatment features in the absense of feedback loops in the system.
Naive seller-side experiments Counterfactual interleaving design
Testing item performance unbiased unbiased
Testing the ranking algorithm biased unbiased
Proposition 2.

Suppose that the experiments follow the dynamics (15), (20)-(24). And we allow for eiT(t)eiC(t)e_{i}^{T}(t)\neq e_{i}^{C}(t) and e^iT(t)e^iC(t)\hat{e}_{i}^{T}(t)\neq\hat{e}_{i}^{C}(t). Then, we have the GTE estimator (19) is unbiased of the true GTE (14), i.e., 𝔼[GTE^]=GTE.\mathbb{E}[\widehat{GTE}]=\mathrm{GTE}.

Table 1 summarizes all four cases discussed in this subsection, clearly indicating that interference is not severe in the absence of feedback loops in the system and that counterfactual interleaving designs can provide unbiased estimations.

4.3 Seller-Side Experiments with Feedback Loops

In this subsection, we will incorporate the state process (4) in our model and evaluate the biases of naive seller-side experiments and the Counterfactual interleaving design for testing different types of treatment features. To the ease of exposition, we assume

Γ(Si(t),ei(t),ri(t),Ii(t),t)=Oi(t)=Ii(t)ei(t).\Gamma\left(S_{i}(t),e_{i}(t),r_{i}(t),I_{i}(t),t\right)=O_{i}(t)=I_{i}(t)e_{i}(t). (25)

Note that the e-commerce and cold start examples fit this assumption well. In the e-commerce scenario, ei(t)e_{i}(t) could represent a customer’s decision to purchase an item from seller ii at time tt, and Si(t)S_{i}(t) would denote the cumulative sales of seller ii up to time tt. In the cold start example, ei(t)e_{i}(t) might model a single view of a new video ii, with Si(t)S_{i}(t) indicating the total exposure of seller ii up to time tt. Furthermore, in the advertising example, the state process dynamics slightly differ from the assumption (25)(\ref{assump:Gamma}), as dSi(t)/dtdS_{i}(t)/dt is usually the payment for that auction, while ei(t)e_{i}(t) involves metrics like conversions, installs, or clicks. Nevertheless, as we shall see in Section 5, similar results will still hold. In the following discussion of this section, we will use the sales rate control in e-commerce as a running example.

4.3.1 Naive seller-side experiments

We rewrite the dynamics for naive seller-side experiment with interference by feedback loops:

OiE(t)\displaystyle O_{i}^{E}(t) =IiE(t)eiE(t),\displaystyle=I_{i}^{E}(t)e_{i}^{E}(t), (26)
IiE(t)\displaystyle I_{i}^{E}(t) =𝕀{fiNE(t)0},\displaystyle=\mathbb{I}\left\{f_{i}^{NE}(t)\geq 0\right\}, (27)
riE(t)\displaystyle r_{i}^{E}(t) =ΨE(SiE(t),e^iE(t)),\displaystyle=\Psi^{E}(S_{i}^{E}(t),\hat{e}_{i}^{E}(t)), (28)
dSiE(t)dt\displaystyle\frac{dS_{i}^{E}(t)}{dt} =OiE(t),\displaystyle=O_{i}^{E}(t), (29)

where fiNE(t)f_{i}^{NE}(t) is the same function defined in Equation (18).

Case 1: Testing item performance e(t)e(t). As usual, we assume e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,,Ni=1,\ldots,N, t[0,H]t\in[0,H] and ΨT(,)=ΨC(,)\Psi^{T}(\cdot,\cdot)=\Psi^{C}(\cdot,\cdot). Furthermore, for simplicity, we assume a uniform treatment effect:

Assumption 2.

eiT(t)eiC(t){e}_{i}^{T}(t)\geq{e}_{i}^{C}(t), for i=1,,Ni=1,\ldots,N, t[0,H]t\in[0,H].

The following technical assumption is used in the proof.

Assumption 3.

ei(),e^i(),fi(),Φ{e}_{i}(\cdot),\hat{e}_{i}(\cdot),f_{i}(\cdot),\Phi are continuous and the events fiNE(t)=0f_{i}^{NE}(t)=0, fiGT(t)=0f_{i}^{GT}(t)=0, fiGC(t)=0f_{i}^{GC}(t)=0 only occur finite times, for i=1,,Ni=1,\ldots,N, i.e., the following set contains finite elements

{t:fiNE(t)=0 or fiGT(t)=0 or fiGC(t)=0 for some i}.\left\{t:f_{i}^{NE}(t)=0\text{ or }f_{i}^{GT}(t)=0\text{ or }f_{i}^{GC}(t)=0\text{ for some }i\right\}.

Then, Theorem 1 characterizes that bias directions in this case.

Theorem 1.

Suppose Assumptions 1,2 and 3 are enforced and e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H], with ΨT(,)=ΨC(,)\Psi^{T}(\cdot,\cdot)=\Psi^{C}(\cdot,\cdot). We consider a naive seller-side experiment following the dynamics (26)-(29). Then, we have

0HOiC(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{C}(t)dt 0HOiGC(t)𝑑t for i𝒞,\displaystyle\geq\int_{0}^{H}O_{i}^{GC}(t)dt\text{ for }i\in\mathcal{C}, (30)
0HOiT(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{T}(t)dt 0HOiGT(t)𝑑t for i𝒯,\displaystyle\leq\int_{0}^{H}O_{i}^{GT}(t)dt\text{ for }i\in\mathcal{T}, (31)

which means the estimator underestimates the true effects, i.e.,

𝔼[GTE^]GTE.\mathbb{E}[\widehat{GTE}]\leq\mathrm{GTE}.

The insights from these results can be intuitively explained as follows. Due to the damping effect inherent in pacing algorithms, a pacing algorithm will slow down if the current consumption is already high. If we consider a treatment feature that enhances performance, resulting in higher consumption for treatment sellers, then the ranking scores for these sellers, after being adjusted by the pacing algorithm, will be lower. Consequently, this adjustment leads to an underestimation of the treatment effects.

Table 2: Comparing the bias in naive seller-side experiments and counterfactual interleaving design for different types of treatment features in the presence of feedback loops in the system. Suppose GTE is positive and a pacing algorithm is used.
Naive seller-side experiments Counterfactual interleaving design
Testing item performance underestimate underestimate
Testing the ranking algorithm unclear underestimate

Case 2: Testing the ranking algorithm ri(t)r_{i}(t). In this case, let consider a scenario where ΨT(s,e^iE(t))ΨC(s,e^iC(t)).\Psi^{T}(s,\hat{e}_{i}^{E}(t))\neq\Psi^{C}(s,\hat{e}_{i}^{C}(t)). This could represent two possible situations:

  1. 1.

    Alternating the feedback loop dependence. For instance, if a platform wishes to experiment with a quicker sales speed for specific categories, it implies that sellers, at any level of cumulative sales, would receive a higher ranking score. Mathematically, this is depicted as

    ΨT(s,e^)ΨC(s,e^) and e^iT(t)=e^iC(t).\Psi^{T}(s,\hat{e})\geq\Psi^{C}(s,\hat{e})\text{ and }\hat{e}_{i}^{T}(t)=\hat{e}_{i}^{C}(t).
  2. 2.

    Modifying the estimation score. This could mean the platform aims to boost certain sellers to achieve specific campaign objectives. Mathematically, this is written as

    ΨT(s,e^)=ΨC(s,e^) and e^iT(t)e^iC(t).\Psi^{T}(s,\hat{e})=\Psi^{C}(s,\hat{e})\text{ and }\hat{e}_{i}^{T}(t)\geq\hat{e}_{i}^{C}(t).

    Furthermore, the platform might develop new machine learning models to ensure that the estimated e^i(t)\hat{e}_{i}(t) aligns more accurately with the actual metrics ei(t){e}_{i}(t). in this case,

    ΨT(s,e^)=ΨC(s,e^),e^iT(t)e^iC(t) and i=1NOiGT(t)i=1NOiGC(t).\Psi^{T}(s,\hat{e})=\Psi^{C}(s,\hat{e}),\hat{e}_{i}^{T}(t)\neq\hat{e}_{i}^{C}(t)\text{ and }\sum_{i=1}^{N}O_{i}^{GT}(t)\geq\sum_{i=1}^{N}O_{i}^{GC}(t).

In this scenario, the direction of bias is unclear. Let consider

ΨT(s,e^iE(t))ΨC(s,e^iC(t)).\Psi^{T}(s,\hat{e}_{i}^{E}(t))\geq\Psi^{C}(s,\hat{e}_{i}^{C}(t)).

On the one hand, as discussed in Section 4.2, due to higher initial ranking scores before adjustment, sellers under treatment would generally receive higher rankings, leading to cannibalization effects. However, on the other hand, the damping effect present in pacing algorithms results in lower ranking scores, which conversely causes the ranking of treatment sellers to drop.

4.3.2 Counterfactual interleaving design.

The dynamics of the counterfactual interleaving design in the presence of feedback loops are described below.

OiE(t)\displaystyle O_{i}^{E}(t) =IiE(t)eiE(t),\displaystyle=I_{i}^{E}(t)e_{i}^{E}(t), (32)
IiT(t)\displaystyle I_{i}^{T}(t) =𝕀{fiCE,T(t)0}, for i𝒯 and\displaystyle=\mathbb{I}\left\{f_{i}^{CE,T}(t)\geq 0\right\},\text{ for }i\in\mathcal{T}\text{ and } (33)
IiC(t)\displaystyle I_{i}^{C}(t) =𝕀{fiCE,C(t)0}, for i𝒞,\displaystyle=\mathbb{I}\left\{f_{i}^{CE,C}(t)\geq 0\right\},\text{ for }i\in\mathcal{C}, (34)
riT(t)\displaystyle r_{i}^{T}(t) =ΨE(SiE(t),e^iC(t)) and riC(t)=ΨE(SiE(t),e^iT(t)),\displaystyle=\Psi^{E}(S_{i}^{E}(t),\hat{e}_{i}^{C}(t))\text{ and }r_{i}^{C}(t)=\Psi^{E}(S_{i}^{E}(t),\hat{e}_{i}^{T}(t)), (35)
dSiE(t)dt\displaystyle\frac{dS_{i}^{E}(t)}{dt} =OiE(t),\displaystyle=O_{i}^{E}(t), (36)

where fiCE,T(t)f_{i}^{CE,T}(t) and fiCE,C(t)f_{i}^{CE,C}(t) are defined in (22) and (23).

Case 1: Testing item performance ei(t)e_{i}(t). Same as above, we assume e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H].

Theorem 2.

Suppose Assumptions 1, 2 and 3111We replace fNE()f^{NE}(\cdot) with fCE,T()f^{CE,T}(\cdot) and fCE,C()f^{CE,C}(\cdot). are enforced and e^iC(t)=e^iT(t)\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{T}(t) for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H], with ΨT(,)=ΨC(,)\Psi^{T}(\cdot,\cdot)=\Psi^{C}(\cdot,\cdot). We consider a counterfactual interleaving design following the dynamics (32)-(36). Then, we have

0HOiC(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{C}(t)dt 0HOiGC(t)𝑑t for i𝒞,\displaystyle\geq\int_{0}^{H}O_{i}^{GC}(t)dt\text{ for }i\in\mathcal{C}, (37)
0HOiT(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{T}(t)dt 0HOiGT(t)𝑑t for i𝒯,\displaystyle\leq\int_{0}^{H}O_{i}^{GT}(t)dt\text{ for }i\in\mathcal{T}, (38)

which means the estimator underestimates the true effects, i.e.,

𝔼[GTE^]GTE.\mathbb{E}[\widehat{GTE}]\leq\mathrm{GTE}.

The proof follows from Theorem 1 by noting that riT(t)=riC(t)r_{i}^{T}(t)=r_{i}^{C}(t) and fiCE,T()=fiCE,C()=fiNE()f_{i}^{CE,T}(\cdot)=f_{i}^{CE,C}(\cdot)=f_{i}^{NE}(\cdot).

Case 2: Testing the ranking algorithm ri(t)r_{i}(t). For the ease of exposition, we assume the following uniform treatment effects on the ranking function.

Assumption 4.

We assume there exists α>1\alpha>1 such that

ΨT(s,e^iE(t))=αΨC(s,e^iC(t)), and\displaystyle\Psi^{T}(s,\hat{e}_{i}^{E}(t))=\alpha\Psi^{C}(s,\hat{e}_{i}^{C}(t)),\text{ and}
fi(αr1,,αrN)fi(r1,,rN) if fi(r1,,rN)0,\displaystyle f_{i}(\alpha r_{1},\ldots,\alpha r_{N})\geq f_{i}(r_{1},\ldots,r_{N})\text{ if }f_{i}(r_{1},\ldots,r_{N})\geq 0,

for s0,i=1,2,,Ns\geq 0,i=1,2,\ldots,N, and t[0,H]t\in[0,H].

Assumption 4 guarantees that an item recommended by the control algorithm would also be recommended by the treatment algorithm. Function (5) meets this requirement. The following theorem states that the estimator obtained from counterfactual interleaving design will also underestimate the true GTE when testing the ranking algorithm in the presence of feedback loops.

Theorem 3.

Suppose Assumptions 1, 3222We replace fNE()f^{NE}(\cdot) with fCE,T()f^{CE,T}(\cdot) and fCE,C()f^{CE,C}(\cdot). and 4 are enforced and eiC(t)=eiT(t){e}_{i}^{C}(t)={e}_{i}^{T}(t) for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H]. We consider a counterfactual interleaving design following the dynamics (32)-(36). Then, we have

0HOiC(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{C}(t)dt 0HOiGC(t)𝑑t for i𝒞\displaystyle\geq\int_{0}^{H}O_{i}^{GC}(t)dt\text{ for }i\in\mathcal{C} (39)
0HOiT(t)𝑑t\displaystyle\int_{0}^{H}O_{i}^{T}(t)dt 0HOiGT(t)𝑑t for i𝒯,\displaystyle\leq\int_{0}^{H}O_{i}^{GT}(t)dt\text{ for }i\in\mathcal{T}, (40)

which means the estimator underestimates the true effects, i.e.,

𝔼[GTE^]GTE.\mathbb{E}[\widehat{GTE}]\leq\mathrm{GTE}.

Let’s explain these results in an intuitive manner. Consider a scenario where the treatment algorithms employ a more rapid pacing speed. This implies that at any given time, the consumption of inventories under the treatment algorithms would typically exceed that of the control algorithms. In the counterfactual interleaving design, this would mean that treatment sellers consume more than their control counterparts on average. When the treatment algorithm is employed across all items to derive Ranking T, treatment items, on average, tend to rank lower compared to their positions in the global treatment setting. This is primarily due to the damping effect inherent in pacing algorithms. As a result, the treatment ranking realized in the experiments may not accurately reflect the ranking in the global treatment regime.

Table 2 summarizes all four cases discussed in this subsection. It is evident that interference is always present when the system is affected by feedback loops. Additionally, the counterfactual interleaving design typically underestimates the true effect if a pacing algorithm with damping effects is employed.

Detecting interference induced by feedback loops. To this end, we propose a simple method to detect the interference induced by feedback loops in the counterfactual interleaving design. We will compare the average rankings of treatment sellers and control sellers when both are under treatment algorithms, i.e., comparing {riT,i𝒯}\{r_{i}^{T},i\in\mathcal{T}\} with {riT,i𝒞}\{r_{i}^{T},i\in\mathcal{C}\}. If these averages are similar, then the interference is likely not significant. However, if there is a noticeable deviation as the experiment progresses, the interference may render the experiment invalid. Similarly, we can also compare the average rankings of the treatment sellers and control sellers both under the control algorithms, i.e., comparing {riC,i𝒯}\{r_{i}^{C},i\in\mathcal{T}\} with {riC,i𝒞}\{r_{i}^{C},i\in\mathcal{C}\}. We shall implement this method in our empirical study.

5 Empirical Results based on Real-World A/B tests

We partnered with the advertising recommendation team at Tencent, a world-class content-sharing platform. In the advertising recommendations, it is commonly observed the estimated scores overestimate the real effect, due to perhaps maximization bias, especially for those ads with low impressions (Fan et al., 2022). To address this bias, Tencent implemented a strategy: at any time tt, if the cumulative realized value (clicks, conversions, etc.) up to that moment surpasses the cumulative estimated value, the current estimated scores are adjusted upward. Conversely, if the overall realized value falls short of the cumulative estimated value, the scores are decremented.

This type of feedback loop differs slightly from the pacing algorithm model discussed in this paper. However, the results and insights derived should be similar. In mathematical terms, ei(t)e_{i}(t) is the raw estimated score for the ii-th ad in relation to a request at time tt. Additionally, si()s_{i}(\cdot) is the state process that captures the cumulative overestimation (or underestimation) for the ii-th ad up until time tt, where

si(t)=(0tri(s)𝑑s)/(0tei(s)𝑑s).s_{i}(t)=\left.\left(\int_{0}^{t}r_{i}(s)ds\right)\right/\left(\int_{0}^{t}e_{i}(s)ds\right).

Given these, the ranking score can be defined as ri(t)=λi(si(t))ei(t)r_{i}(t)=\lambda_{i}(s_{i}(t))e_{i}(t), for i=1,,N,i=1,\ldots,N, where λi(si(t))\lambda_{i}(s_{i}(t)) represents the adjustment factor. At the beginning of a day, λ\lambda is initialized at 1. Further, λi(s)\lambda_{i}(s) decreases as ss increases, and λi(si(t))<1\lambda_{i}(s_{i}(t))<1 adjusts for overestimation (si(t)>1s_{i}(t)>1), while λi(si(t))>1\lambda_{i}(s_{i}(t))>1 adjusts for underestimation (si(t)<1s_{i}(t)<1).

Our empirical observation reveals a challenge of this adjusting mechanism: the values of λ\lambdas tend to fluctuate significantly due to the inherent randomness in realized values. Such volatility can negatively impact the overall performance. To address this, we devised a new strategy to constrain the variability of λ\lambda, effectively reducing its swing or “effective range.”

In our experiments, we’ll contrast this new strategy with the original approach using the counterfactual interleaving design. More precisely, we’ll be comparing the treatment adjustment method λT()\lambda^{T}(\cdot) with the control adjustment method λC()\lambda^{C}(\cdot), where we allocate 10% control ads and 10% treatment ads. Specifically, the platform ranks and charges based on the scores

riT(t)=λiT(si(t))ei(t), and riC(t)=λiC(si(t))ei(t),\displaystyle r^{T}_{i}(t)=\lambda^{T}_{i}(s_{i}(t))e_{i}(t),\text{ and }r^{C}_{i}(t)=\lambda^{C}_{i}(s_{i}(t))e_{i}(t),

for i=1,2,,N=1,2,\ldots,N, and |λiT(s)1|<|λiC(s)1||\lambda_{i}^{T}(s)-1|<|\lambda_{i}^{C}(s)-1| for any s+s\in\mathbb{R}_{+}.

Despite simulations and A/A tests consistently indicating the superiority of the treatment strategy over the control strategy, the counterfactual interleaving design suggests otherwise: Table 3 shows that the estimation from the counterfactual interleaving experiments suggest a strongly negative effects.

We attribute this discrepancy to interference arising from feedback loops. Given the systematic overestimation and our specific treatment strategy, it’s anticipated that λiT(si(t))\lambda_{i}^{T}(s_{i}(t-)) would typically exceed λiC(si(t))\lambda_{i}^{C}(s_{i}(t-)). Consequently, siT(t)>siC(t)s^{T}_{i}(t-)>s^{C}_{i}(t-) in general. Due to the inverse monotonic behavior of λ()\lambda(\cdot), this means that λiT(siT(t))<λiT(siC(t))\lambda_{i}^{T}(s^{T}_{i}(t-))<\lambda_{i}^{T}(s^{C}_{i}(t-)) and λiC(siT(t))<λiC(siC(t))\lambda_{i}^{C}(s^{T}_{i}(t-))<\lambda_{i}^{C}(s^{C}_{i}(t-)), causing that the treatment ads frequently rank lower, while control ads often rank higher in the experiment. To bolster our rationale, we visualize the average λT\lambda^{T} values under the treatment strategy, λ¯T\bar{\lambda}^{T}s, for the control ads, treatment ads, and other ads over time in Figure 4. The figure elucidates that while the λ¯T\bar{\lambda}^{T}s are nearly identical across the three groups initially, the λ¯T\bar{\lambda}^{T}s in the treatment group noticeably diminishes towards the day’s end.

Table 3: The experimental results using the counterfactual interleaving design
Advertising cost (consumption) Views Gross merchandise value (GMV)
Estimator Confidence Interval Estimator Confidence Interval Estimator Confidence Interval
-23% [-34%,-12%] -27% [-38%,-15%] -21% [-34%,-9%]
Refer to caption
Figure 4: Average λT\lambda^{T} values, λ¯T\bar{\lambda}^{T}s, for the control ads, treatment ads, and other ads over time in a day.

6 Conclusion

In this paper, we develop a framework to study seller-side experiments affected by interference from feedback loops. Our findings indicate that counterfactual interleaving designs tend to underestimate the true treatment effects when a damping pace algorithm is employed. Additionally, we introduce a method for detecting interference by comparing the rankings of treatment and control groups under the same treatment/control algorithms. As a direction for future research, it would be valuable to explore how experiments can be conducted optimally in scenarios with feedback loops.

References

  • (1)
  • Aronow and Samii (2013) Peter M Aronow and Cyrus Samii. 2013. Estimating Average Causal Effects Under General Interference, with Application to a Social Network Experiment. arXiv preprint arXiv:1305.6156 (2013).
  • Bajari et al. (2021) Patrick Bajari, Brian Burdick, Guido W Imbens, Lorenzo Masoero, James McQueen, Thomas Richardson, and Ido M Rosen. 2021. Multiple randomization designs. arXiv preprint arXiv:2112.13495 (2021).
  • Basse et al. (2023) Guillaume W Basse, Yi Ding, and Panos Toulis. 2023. Minimax designs for causal effects in temporal experiments with treatment habituation. Biometrika 110, 1 (2023), 155–168.
  • Blake and Coey (2014) Thomas Blake and Dominic Coey. 2014. Why marketplace experimentation is harder than it seems: The role of test-control interference. In Proceedings of the fifteenth ACM conference on Economics and computation. 567–582.
  • Bojinov et al. (2023) Iavor Bojinov, David Simchi-Levi, and Jinglong Zhao. 2023. Design and analysis of switchback experiments. Management Science 69, 7 (2023), 3759–3777.
  • Bright et al. (2022) Ido Bright, Arthur Delarue, and Ilan Lobel. 2022. Reducing marketplace interference bias via shadow prices. arXiv preprint arXiv:2205.02274 (2022).
  • Candogan et al. (2023) Ozan Candogan, Chen Chen, and Rad Niazadeh. 2023. Correlated cluster-based randomized experiments: Robust variance minimization. Management Science (2023).
  • Chaney et al. (2018) Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM conference on recommender systems. 224–232.
  • Eckles et al. (2017) Dean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference 5, 1 (2017).
  • Fan et al. (2022) Yewen Fan, Nian Si, and Kun Zhang. 2022. Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems. arXiv preprint arXiv:2205.09809 (2022).
  • Farias et al. (2022) Vivek Farias, Andrew Li, Tianyi Peng, and Andrew Zheng. 2022. Markovian interference in experiments. Advances in Neural Information Processing Systems 35 (2022), 535–549.
  • Farias et al. (2023) Vivek Farias, Hao Li, Tianyi Peng, Xinyuyang Ren, Huawei Zhang, and Andrew Zheng. 2023. Correcting for Interference in Experiments: A Case Study at Douyin. In Proceedings of the 17th ACM Conference on Recommender Systems. 455–466.
  • Fradkin (2015) Andrey Fradkin. 2015. Search frictions and the design of online marketplaces. In The Third Conference on Auctions, Market Mechanisms and Their Applications.
  • Glynn et al. (2020) Peter W Glynn, Ramesh Johari, and Mohammad Rasouli. 2020. Adaptive experimental design with temporal interference: A maximum likelihood approach. Advances in Neural Information Processing Systems 33 (2020), 15054–15064.
  • Goli et al. (2023) Ali Goli, Anja Lambrecht, and Hema Yoganarasimhan. 2023. A bias correction approach for interference in ranking experiments. Marketing Science (2023).
  • Gui et al. (2015) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network a/b testing: From sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web. 399–409.
  • Guo et al. (2023) Hongbo Guo, Ruben Naeff, Alex Nikulkov, and Zheqing Zhu. 2023. Evaluating Online Bandit Exploration In Large-Scale Recommender System. In KDD-23 Workshop on Multi-Armed Bandits and Reinforcement Learning: Advancing Decision Making in E-Commerce and Beyond.
  • Ha-Thuc et al. (2020) Viet Ha-Thuc, Avishek Dutta, Ren Mao, Matthew Wood, and Yunli Liu. 2020. A counterfactual framework for seller-side a/b testing on marketplaces. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2296.
  • Harshaw et al. (2023) Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, and Jean Pouget-Abadie. 2023. Design and analysis of bipartite experiments under a linear exposure-response model. Electronic Journal of Statistics 17, 1 (2023), 464–518.
  • Holtz and Aral (2020) David Holtz and Sinan Aral. 2020. Limiting bias from test-control interference in online marketplace experiments. arXiv preprint arXiv:2004.12162 (2020).
  • Holtz et al. (2023) David Holtz, Jennifer Brennan, and Jean Pouget-Abadie. 2023. A Study of” Symbiosis Bias” in A/B Tests of Recommendation Algorithms. arXiv preprint arXiv:2309.07107 (2023).
  • Holtz et al. (2020) David Holtz, Ruben Lobel, Inessa Liskovich, and Sinan Aral. 2020. Reducing interference bias in online marketplace pricing experiments. arXiv preprint arXiv:2004.12489 (2020).
  • Hu and Wager (2022) Yuchen Hu and Stefan Wager. 2022. Switchback experiments under geometric mixing. arXiv preprint arXiv:2209.00197 (2022).
  • Hudgens and Halloran (2008) Michael G Hudgens and M Elizabeth Halloran. 2008. Toward causal inference with interference. J. Amer. Statist. Assoc. 103, 482 (2008), 832–842.
  • Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Jadidinejad et al. (2020) Amir H Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using exploration to alleviate closed loop effects in recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2025–2028.
  • Johari et al. (2022) Ramesh Johari, Hannah Li, Inessa Liskovich, and Gabriel Y Weintraub. 2022. Experimental design in two-sided platforms: An analysis of bias. Management Science (2022).
  • Karlsson (2020) Niklas Karlsson. 2020. Feedback control in programmatic advertising: The frontier of optimization in real-time bidding. IEEE Control Systems Magazine 40, 5 (2020), 40–77.
  • Khenissi (2022) Sami Khenissi. 2022. Modeling and debiasing feedback loops in collaborative filtering recommender systems. (2022).
  • Krauth et al. (2022) Karl Krauth, Yixin Wang, and Michael I Jordan. 2022. Breaking feedback loops in recommender systems with causal inference. arXiv preprint arXiv:2207.01616 (2022).
  • Li et al. (2023) Shuangning Li, Ramesh Johari, Stefan Wager, and Kuang Xu. 2023. Experimenting under Stochastic Congestion. arXiv preprint arXiv:2302.12093 (2023).
  • Li and Wager (2022) Shuangning Li and Stefan Wager. 2022. Random graph asymptotics for treatment effect estimation under network interference. The Annals of Statistics 50, 4 (2022), 2334–2358.
  • Liu et al. (2021) Min Liu, Jialiang Mao, and Kang Kang. 2021. Trustworthy and powerful online marketplace experimentation with budget-split design. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3319–3329.
  • Mansoury et al. (2020) Masoud Mansoury, Himan Abdollahpouri, Mykola Pechenizkiy, Bamshad Mobasher, and Robin Burke. 2020. Feedback loop and bias amplification in recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management. 2145–2148.
  • Musgrave et al. (2023) Paul Musgrave, Cuize Han, and Parth Gupta. 2023. Measuring service-level learning effects in search via query-randomized experiments. (2023).
  • Nandy et al. (2021) Preetam Nandy, Divya Venugopalan, Chun Lo, and Shaunak Chatterjee. 2021. A/b testing for recommender systems in a two-sided marketplace. Advances in Neural Information Processing Systems 34 (2021), 6466–6477.
  • Ni et al. (2023) Tu Ni, Iavor Bojinov, and Jinglong Zhao. 2023. Design of Panel Experiments with Spatial and Temporal Interference. Available at SSRN 4466598 (2023).
  • Pan et al. (2021) Weishen Pan, Sen Cui, Hongyi Wen, Kun Chen, Changshui Zhang, and Fei Wang. 2021. Correcting the user feedback-loop bias for recommendation systems. arXiv preprint arXiv:2109.06037 (2021).
  • Pearl (2000) Judea Pearl. 2000. Causality: Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress 19, 2 (2000), 3.
  • Pouget-Abadie et al. (2019) Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, and Vahab Mirrokni. 2019. Variance reduction in bipartite experiments through correlation clustering. Advances in Neural Information Processing Systems 32 (2019).
  • Si (2023) Nian Si. 2023. Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach. arXiv preprint arXiv:2310.17496 (2023).
  • Si et al. (2022) Nian Si, San Gultekin, Jose Blanchet, and Aaron Flores. 2022. Optimal Bidding and Experimentation for Multi-Layer Auctions in Online Advertising. Available at SSRN (2022). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4358914
  • Ugander et al. (2013) Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. 2013. Graph cluster randomization: Network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 329–337.
  • Ugander and Yin (2023) Johan Ugander and Hao Yin. 2023. Randomized graph cluster randomization. Journal of Causal Inference 11, 1 (2023), 20220014.
  • Wang and Ba (2023) Yan Wang and Shan Ba. 2023. Producer-Side Experiments Based on Counterfactual Interleaving Designs for Online Recommender Systems. arXiv preprint arXiv:2310.16294 (2023).
  • Xiong et al. (2019) Ruoxuan Xiong, Susan Athey, Mohsen Bayati, and Guido Imbens. 2019. Optimal experimental design for staggered rollouts. arXiv preprint arXiv:1911.03764 (2019).
  • Xiong et al. (2023b) Ruoxuan Xiong, Alex Chin, Mohsen Bayati, and Sean Taylor. 2023b. Data-Driven Switchback Design. preprint (2023). https://www.ruoxuanxiong.com/data-driven-switchback-design.pdf
  • Xiong et al. (2023a) Ruoxuan Xiong, Alex Chin, and Sean Taylor. 2023a. Bias-variance tradeoffs for designing simultaneous temporal experiments. In The KDD’23 Workshop on Causal Discovery, Prediction and Decision. PMLR, 115–131.
  • Yang et al. (2023) Mengyue Yang, Jun Wang, and Jean-Francois Ton. 2023. Rectifying unfairness in recommendation feedback loop. In Proceedings of the 46th international ACM SIGIR Conference on Research and Development in Information Retrieval. 28–37.
  • Ye et al. (2023) Zikun Ye, Dennis J Zhang, Heng Zhang, Renyu Zhang, Xin Chen, and Zhiwei Xu. 2023. Cold start to improve market thickness on online advertising platforms: Data-driven algorithms and field experiments. Management Science 69, 7 (2023), 3838–3860.
  • Yu et al. (2022) Christina Lee Yu, Edoardo M Airoldi, Christian Borgs, and Jennifer T Chayes. 2022. Estimating the total treatment effect in randomized experiments with unknown network structure. Proceedings of the National Academy of Sciences 119, 44 (2022), e2208975119.

Appendix A Proofs

Proof of Proposition 1.

Recall in this case without feedback loops, we have ΦT(s,e^)=ΦC(s,e^)=e^\Phi^{T}(s,\hat{e})=\Phi^{C}(s,\hat{e})=\hat{e}. Therefore, we have

riGT(t)=riT(t)=riC(t)=riGC(t),r_{i}^{GT}(t)=r_{i}^{T}(t)=r_{i}^{C}(t)=r_{i}^{GC}(t),

for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H], because of

e^iGT(t)=e^iT(t)=e^iC(t)=e^iGC(t).\hat{e}_{i}^{GT}(t)=\hat{e}_{i}^{T}(t)=\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{GC}(t).

It further yields

IiGT(t)=IiT(t)=IiC(t)=IiGC(t),I_{i}^{GT}(t)=I_{i}^{T}(t)=I_{i}^{C}(t)=I_{i}^{GC}(t),

which means

𝔼[(1p0HOiT(t)𝑑t)𝕀{i𝒯}]=0HOiGT(t)𝑑t and\displaystyle\mathbb{E}\left[\left(\frac{1}{p}\int_{0}^{H}O_{i}^{T}(t)dt\right)\mathbb{I}\{i\in\mathcal{T}\}\right]=\int_{0}^{H}O_{i}^{GT}(t)dt\text{ and}
𝔼[(11p0HOiT(t)𝑑t)𝕀{i𝒞}]=0HOiGC(t)𝑑t,\displaystyle\mathbb{E}\left[\left(\frac{1}{1-p}\int_{0}^{H}O_{i}^{T}(t)dt\right)\mathbb{I}\{i\in\mathcal{C}\}\right]=\int_{0}^{H}O_{i}^{GC}(t)dt,

for i=1,2,,Ni=1,2,\ldots,N. The desired results then follow. ∎

Proof of Proposition 2.

From (24), we know

riGT(t)=riT(t) and riC(t)=riGC(t),r_{i}^{GT}(t)=r_{i}^{T}(t)\text{ and }r_{i}^{C}(t)=r_{i}^{GC}(t),

for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H], given that

e^iGT(t)=e^iT(t) and e^iC(t)=e^iGC(t).\hat{e}_{i}^{GT}(t)=\hat{e}_{i}^{T}(t)\text{ and }\hat{e}_{i}^{C}(t)=\hat{e}_{i}^{GC}(t).

Then, by comparing (20), (22) with (7) and (21), (23) with (11), we observe

IiGT(t)=IiT(t) and IiC(t)=IiGC(t).I_{i}^{GT}(t)=I_{i}^{T}(t)\text{ and }I_{i}^{C}(t)=I_{i}^{GC}(t).

The rest of proof follows the proof of Proposition 1. ∎

Proof of Theorem 1.

We prove the following claim

0tOiE(s)𝑑s0tOiGT(s)𝑑s, \int_{0}^{t}O_{i}^{E}(s)ds\leq\int_{0}^{t}O_{i}^{GT}(s)ds,\text{ }

for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H] by contradiction. Suppose that there exists some i=i= 1,2,,N1,2,\ldots,N and t[0,H]t\in[0,H] such that

0tOiE(s)𝑑s>0tOiGT(s)𝑑s.\int_{0}^{t}O_{i}^{E}(s)ds>\int_{0}^{t}O_{i}^{GT}(s)ds.

We first identify when this inequality occurs for the first time:

τ=inf{t: there exists i such that 0tOiE(s)𝑑s>0tOiGT(s)𝑑s},\tau=\inf\left\{t:\text{ there exists }i\text{ such that }\int_{0}^{t}O_{i}^{E}(s)ds>\int_{0}^{t}O_{i}^{GT}(s)ds\right\},

and let \mathcal{I}^{\ast} contain the corresponding sellers. That is, for ii^{*}\in\mathcal{I}^{*}, we have

τ=inf{t:0tOiE(s)𝑑s>0tOiGT(s)𝑑s},\tau=\inf\left\{t:\int_{0}^{t}O_{i^{*}}^{E}(s)ds>\int_{0}^{t}O_{i^{*}}^{GT}(s)ds\right\}, (41)

and for jj\notin\mathcal{I}^{*}, there exists t0>τt_{0}>\tau such that for t[τ,t0]t\in[\tau,t_{0}],

0tOjE(s)𝑑s0tOjGT(s)𝑑s.\int_{0}^{t}O_{j}^{E}(s)ds\leq\int_{0}^{t}O_{j}^{GT}(s)ds.

By continuity, we have

0τOiE(s)𝑑s=0τOiGT(s)𝑑s, for i,\int_{0}^{\tau}O_{i^{\ast}}^{E}(s)ds=\int_{0}^{\tau}O_{i^{\ast}}^{GT}(s)ds,\text{ for }i^{\ast}\in\mathcal{I}^{\ast}, (42)

which implies SiE(τ)=SiGT(τ),S_{i^{*}}^{E}(\tau)=S_{i^{*}}^{GT}(\tau), for i.i^{\ast}\in\mathcal{I}^{\ast}.

Due to the inverse monotonocity in Assumption 1.(2), we have

riE(τ)\displaystyle r_{i^{\ast}}^{E}(\tau) =\displaystyle= riGT(τ) for i,\displaystyle r_{i^{\ast}}^{GT}(\tau)\text{ for }i^{\ast}\in\mathcal{I}^{\ast},\text{ } (43)
rjE(τ)\displaystyle r_{j}^{E}(\tau) \displaystyle\geq rjGT(τ) for j.\displaystyle r_{j}^{GT}(\tau)\text{ for }j\notin\mathcal{I}^{\ast}. (44)

Then, by Assumption 1.(1), we have

fiNE(τ)\displaystyle f_{i^{\ast}}^{NE}(\tau)
=\displaystyle= fi({rjE(τ),j=1,2,,N},R¯(τ))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{E}(\tau),j=1,2,\ldots,N\right\},\underline{R}(\tau)\right)
\displaystyle\leq fi({rjGT(τ),j=1,2,,N},R¯(τ))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{GT}(\tau),j=1,2,\ldots,N\right\},\underline{R}(\tau)\right)
=\displaystyle= fiGT(τ).\displaystyle f_{i^{\ast}}^{GT}(\tau).

We then consider three different cases.

Case 1: There exists some ii^{\ast}\in\mathcal{I}^{\ast} such that

fiGT(τ)>0.f_{i^{\ast}}^{GT}(\tau)>0.

By Assumption 3, we have there exists t1>τ,t_{1}>\tau, such that

fiGT(t)0, for t[τ,t1],f_{i^{\ast}}^{GT}(t)\geq 0,\text{ for }t\in[\tau,t_{1}]\ ,

which means IiGT(t)=1,I_{i^{\ast}}^{GT}(t)=1, for t[τ,t1].t\in[\tau,t_{1}]. Then, we have for t[τ,t1]t\in[\tau,t_{1}]

0tOiGT(s)𝑑s\displaystyle\int_{0}^{t}O_{i^{\ast}}^{GT}(s)ds =\displaystyle= SiGT(τ)+τteiGT(s)𝑑s\displaystyle S_{i^{\ast}}^{GT}(\tau)+\int_{\tau}^{t}e_{i^{\ast}}^{GT}(s)ds
\displaystyle\geq SiE(τ)+τtIiE(s)eiE(s)𝑑s\displaystyle S_{i^{\ast}}^{E}(\tau)+\int_{\tau}^{t}I_{i^{\ast}}^{E}(s)e_{i^{\ast}}^{E}(s)ds
=\displaystyle= 0tOiE(s)𝑑s,\displaystyle\int_{0}^{t}O_{i^{\ast}}^{E}(s)ds,

which contradicts to the definition of τ\tau, as shown in (41).

Case 2: There exists some ii^{\ast}\in\mathcal{I}^{\ast} such that

0>fiNE(τ).0>f_{i^{\ast}}^{NE}(\tau).

By Assumption 3, we have there exists t2>τ,t_{2}>\tau, such that

fiNE(t)<0, for t[τ,t2],f_{i^{\ast}}^{NE}(t)<0,\text{ for }t\in[\tau,t_{2}]\ ,

which means IiE(t)=0,I_{i^{\ast}}^{E}(t)=0, for t[τ,t2].t\in[\tau,t_{2}]. Then, we have for t[t,t2]t\in[t,t_{2}]

0tOiE(s)𝑑s\displaystyle\int_{0}^{t}O_{i^{\ast}}^{E}(s)ds =\displaystyle= SiE(τ)\displaystyle S_{i^{\ast}}^{E}(\tau)
\displaystyle\leq SiGT(τ)+τtIiGT(t)eiGT(s)𝑑s\displaystyle S_{i^{\ast}}^{GT}(\tau)+\int_{\tau}^{t}I_{i^{\ast}}^{GT}(t)e_{i^{\ast}}^{GT}(s)ds
=\displaystyle= 0tOiGT(s)𝑑s,\displaystyle\int_{0}^{t}O_{i^{\ast}}^{GT}(s)ds,

which contradicts to the definition of τ\tau, as shown in (41).

Case 3: For all ii^{\ast}\in\mathcal{I}^{\ast}, we have

fiGT(τ)=fiNE(τ)=0.f_{i^{\ast}}^{GT}(\tau)=f_{i^{\ast}}^{NE}(\tau)=0.

If fiGT(t+)>0f_{i^{\ast}}^{GT}(t+)>0, it reduces to the case 1. Similarly, if fiNE(t+)<0f_{i^{\ast}}^{NE}(t+)<0, it reduces to the case 2. Then, there must exists some t3(τ,t0)t_{3}\in(\tau,t_{0}) such that

fiGT(t)<0 and fiNE(t)>0,f_{i^{\ast}}^{GT}(t)<0\text{ and }f_{i^{\ast}}^{NE}(t)>0, (45)

for t(τ,t3]t\in(\tau,t_{3}] and i.i^{\ast}\in\mathcal{I}^{\ast}. It further yields

SiE(t)\displaystyle S_{i^{\ast}}^{E}(t) \displaystyle\geq SiGT(t), for i, t[τ,t3]\displaystyle S_{i^{\ast}}^{GT}(t),\text{ for }i^{\ast}\in\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}]\text{ } (46)
SjE(t)\displaystyle S_{j}^{E}(t) \displaystyle\leq SjGT(t), for j, t[τ,t3].\displaystyle S_{j}^{GT}(t),\text{ for }j\notin\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}]. (47)

Due to the inverse monotonocity in Assumption 1.(2), we have

riE(t)\displaystyle r_{i^{\ast}}^{E}(t) \displaystyle\leq riGT(t) for i, t[τ,t3]\displaystyle r_{i^{\ast}}^{GT}(t)\text{ for }i^{\ast}\in\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}] (48)
rjE(t)\displaystyle r_{j}^{E}(t) \displaystyle\geq rjGT(t) for j,t[τ,t3].\displaystyle r_{j}^{GT}(t)\text{ for }j\notin\mathcal{I}^{\ast},t\in[\tau,t_{3}]. (49)

Then, by Assumption 1.(1), we have for t(τ,t0],t\in(\tau,t_{0}],

ifiNE(t)\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}^{NE}(t)
=\displaystyle= ifi({rjE(t),j=1,2,,N},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{j}^{E}(t),j=1,2,\ldots,N\right\},\underline{R}(t)\right)
\displaystyle\leq ifi({riE(t),i}{rjGT(t),j},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{i^{\ast}}^{E}(t),i^{\ast}\in\mathcal{I}^{\ast}\right\}\cup\left\{r_{j}^{GT}(t),j\notin\mathcal{I}^{\ast}\right\},\underline{R}(t)\right)
\displaystyle\leq ifi({riGT(t),i}{rjGT(t),j},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{i^{\ast}}^{GT}(t),i^{\ast}\in\mathcal{I}^{\ast}\right\}\cup\left\{r_{j}^{GT}(t),j\notin\mathcal{I}^{\ast}\right\},\underline{R}(t)\right)
=\displaystyle= ifiGT(t),\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}^{GT}(t),

which contradicts to the inequality (45).

Similarly, we can show

0tOiE(s)𝑑s0tOiGC(s)𝑑s, \int_{0}^{t}O_{i}^{E}(s)ds\geq\int_{0}^{t}O_{i}^{GC}(s)ds,\text{ }

for i=1,2,,Ni=1,2,\ldots,N and t[0,H].t\in[0,H].

Proof of Theorem 3.

We proceed in the same manner as the proof of Theorem 1.

We prove the following claim

0tOiE(s)𝑑s0tOiGT(s)𝑑s, \int_{0}^{t}O_{i}^{E}(s)ds\leq\int_{0}^{t}O_{i}^{GT}(s)ds,\text{ }

for i=1,2,,Ni=1,2,\ldots,N and t[0,H]t\in[0,H] by contradiction. Suppose that there exists some i=i= 1,2,,N1,2,\ldots,N and t[0,H]t\in[0,H] such that

0tOiE(s)𝑑s>0tOiGT(s)𝑑s.\int_{0}^{t}O_{i}^{E}(s)ds>\int_{0}^{t}O_{i}^{GT}(s)ds.

We first identify when this inequality occurs for the first time:

τ=inf{t: there exists some i s.t. 0tOiE(s)𝑑s>0tOiGT(s)𝑑s},\tau=\inf\left\{t:\text{ there exists some }i\text{ s.t. }\int_{0}^{t}O_{i}^{E}(s)ds>\int_{0}^{t}O_{i}^{GT}(s)ds\right\},

and let \mathcal{I}^{\ast} contain the corresponding sellers. By continuity, we have

0τOiE(s)𝑑s=0τOiGT(s)𝑑s, for i.\int_{0}^{\tau}O_{i^{\ast}}^{E}(s)ds=\int_{0}^{\tau}O_{i^{\ast}}^{GT}(s)ds,\text{ for }i^{\ast}\in\mathcal{I}^{\ast}. (50)

Similar to Proof of Theorem 1, we have

fiCE,T(τ)\displaystyle f_{i^{\ast}}^{CE,T}(\tau)
=\displaystyle= fi({rjT(τ),j=1,2,,N},R¯(τ))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{T}(\tau),j=1,2,\ldots,N\right\},\underline{R}(\tau)\right)
\displaystyle\leq fi({rjGT(τ),j=1,2,,N},R¯(τ))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{GT}(\tau),j=1,2,\ldots,N\right\},\underline{R}(\tau)\right)
=\displaystyle= fiGT(τ).\displaystyle f_{i^{\ast}}^{GT}(\tau).

Cases 1 and 2 are essentially the same as in the proof of Theorem 1.

Case 3: For all ii^{\ast}\in\mathcal{I}^{\ast}, we have

fiGT(τ)\displaystyle f_{i^{\ast}}^{GT}(\tau) =\displaystyle= 0,\displaystyle 0,
fiCE,T(t)\displaystyle f_{i^{\ast}}^{CE,T}(t) =\displaystyle= 0 if i𝒯 and fiCE,C(t)=0 if i𝒞.\displaystyle 0\text{ if }i^{\ast}\in\mathcal{T}\text{ and }f_{i^{\ast}}^{CE,C}(t)=0\text{ if }i^{\ast}\in\mathcal{C}\text{.}

There must exists some t3(τ,t0)t_{3}\in(\tau,t_{0}) such that

fiGT(t)\displaystyle f_{i^{\ast}}^{GT}(t) <\displaystyle< 0,\displaystyle 0\text{,} (51)
fiCE,T(t)\displaystyle f_{i^{\ast}}^{CE,T}(t) >\displaystyle> 0 if i𝒯 and fiCE,C(t)>0 if i𝒞,\displaystyle 0\text{ if }i^{\ast}\in\mathcal{T}\text{ and }f_{i^{\ast}}^{CE,C}(t)>0\text{ if }i^{\ast}\in\mathcal{C}\text{, } (52)

for t(τ,t3]t\in(\tau,t_{3}] and i.i^{\ast}\in\mathcal{I}^{\ast}.

It further yields

SiE(t)\displaystyle S_{i^{\ast}}^{E}(t) \displaystyle\geq SiGT(t), for i, t[τ,t3]\displaystyle S_{i^{\ast}}^{GT}(t),\text{ for }i^{\ast}\in\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}]\text{ } (53)
SjE(t)\displaystyle S_{j}^{E}(t) \displaystyle\leq SjGT(t), for j, t[τ,t3].\displaystyle S_{j}^{GT}(t),\text{ for }j\notin\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}]. (54)

Due to the inverse monotonocity in Assumption 1.(2), we have

riT(t)\displaystyle r_{i^{\ast}}^{T}(t) \displaystyle\leq riGT(t) for i, t[τ,t3]\displaystyle r_{i^{\ast}}^{GT}(t)\text{ for }i^{\ast}\in\mathcal{I}^{\ast},\text{ }t\in[\tau,t_{3}] (55)
rjT(t)\displaystyle r_{j}^{T}(t) \displaystyle\geq rjGT(t) for j,t[τ,t3].\displaystyle r_{j}^{GT}(t)\text{ for }j\notin\mathcal{I}^{\ast},t\in[\tau,t_{3}]. (56)

First, by Assumption 4, we have

fiCE,C(t)\displaystyle f_{i^{\ast}}^{CE,C}(t)
=\displaystyle= fi({rjC(t),j=1,2,,N},R¯(t))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{C}(t),j=1,2,\ldots,N\right\},\underline{R}(t)\right)
\displaystyle\leq fi({αrjC(t),j=1,2,,N},R¯(t))\displaystyle f_{i^{\ast}}\left(\left\{\alpha r_{j}^{C}(t),j=1,2,\ldots,N\right\},\underline{R}(t)\right)
=\displaystyle= fi({rjT(t),j=1,2,,N},R¯(t))\displaystyle f_{i^{\ast}}\left(\left\{r_{j}^{T}(t),j=1,2,\ldots,N\right\},\underline{R}(t)\right)
\displaystyle\leq fiCE,T(t).\displaystyle f_{i^{\ast}}^{CE,T}(t).

Then, inequalities (51) and (52) imply

fiGT(t)<0fiCE,T(t)>0f_{i^{\ast}}^{GT}(t)<0\text{, }f_{i^{\ast}}^{CE,T}(t)>0\text{, } (57)

for t(τ,t3]t\in(\tau,t_{3}] and i.i^{\ast}\in\mathcal{I}^{\ast}.

Then, by Assumption 1.(1) and the inequalities (55) and (56), we have for t(τ,t0],t\in(\tau,t_{0}],

ifiCE,T(t)\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}^{CE,T}(t)
=\displaystyle= ifi({rjT(t),j=1,2,,N},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{j}^{T}(t),j=1,2,\ldots,N\right\},\underline{R}(t)\right)
\displaystyle\leq ifi({riT(t),i}{rjGT(t),j},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{i^{\ast}}^{T}(t),i^{\ast}\in\mathcal{I}^{\ast}\right\}\cup\left\{r_{j}^{GT}(t),j\notin\mathcal{I}^{\ast}\right\},\underline{R}(t)\right)
\displaystyle\leq ifi({riGT(t),i}{rjGT(t),j},R¯(t))\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}\left(\left\{r_{i^{\ast}}^{GT}(t),i^{\ast}\in\mathcal{I}^{\ast}\right\}\cup\left\{r_{j}^{GT}(t),j\notin\mathcal{I}^{\ast}\right\},\underline{R}(t)\right)
=\displaystyle= ifiGT(t),\displaystyle\sum_{i^{\ast}\in\mathcal{I}^{\ast}}f_{i^{\ast}}^{GT}(t),

which contradicts to the inequalities (57). ∎