This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ColdNAS: Search to Modulate for User Cold-Start Recommendation

Shiguang Wu Department of Electronic Engineering
Tsinghua University
BeijingChina
[email protected]
Yaqing Wang Baidu Inc.BeijingChina [email protected] Qinghe Jing Baidu Inc.BeijingChina [email protected] Daxiang Dong Baidu Inc.BeijingChina [email protected] Dejing Dou Baidu Inc.BeijingChina [email protected]  and  Quanming Yao Department of Electronic Engineering
Tsinghua University
BeijingChina
[email protected]
(2023)
Abstract.

Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map user interaction histories to user-specific parameters, which are then used to modulate predictor by feature-wise linear modulation function. These works obtain the state-of-the-art performance. However, the physical meaning of scaling and shifting in recommendation data is unclear. Instead of using a fixed modulation function and deciding modulation position by expertise, we propose a modulation framework called ColdNAS for user cold-start problem, where we look for proper modulation structure, including function and position, via neural architecture search. We design a search space which covers broad models and theoretically prove that this search space can be transformed to a much smaller space, enabling an efficient and robust one-shot search algorithm. Extensive experimental results on benchmark datasets show that ColdNAS consistently performs the best. We observe that different modulation functions lead to the best performance on different datasets, which validates the necessity of designing a searching-based method. Codes are available at https://github.com/LARS-research/ColdNAS.

User-Cold Start Recommendation, Neural Architecture Search, Few-Shot Learning, Meta-Learning, Hypernetworks
copyright: acmcopyrightjournalyear: 2023copyright: acmlicensedconference: Proceedings of the ACM Web Conference 2023; May 1–5, 2023; Austin, TX, USAbooktitle: Proceedings of the ACM Web Conference 2023 (WWW ’23), May 1–5, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543507.3583344isbn: 978-1-4503-9416-1/23/04ccs: Information systems Recommender systemsccs: Computing methodologies Supervised learningccs: Computing methodologies Neural networks

1. Introduction

Recommendation systems (RSs) (Resnick and Varian, 1997) target at providing suggestions of items that are most pertinent to a particular user, such as movie recommendation (Harper and Konstan, 2015) and book recommendation (Ziegler et al., 2005). Nowadays, RSs are abundant online, offering enormous users convenient ways to shopping regardless of location and time, and also providing intimate suggestions according to their preferences. However, user cold-start recommendation (Schein et al., 2002) remains a severe problem in RSs. On the one hand, the users in RSs follow long tail effect (Park and Tuzhilin, 2008), some users just have a few interaction histories. On the other hand, new users are continuously emerging, who naturally have rated a few items in RSs. Such a problem is even more challenging as modern RSs are mostly built with over-parameterized deep networks, which needs a huge amount of training samples to get good performance and can easily overfit for cold-start users (Volkovs et al., 2017).

Refer to caption
Figure 1. Architecture used in user cold-start models, which consists of embedding layer, adaptation network and predictor. Modulation structure in red dotted lines are the searching target of our paper.

User cold-start recommendation problem can naturally be modeled as a few-shot learning problem (Wang et al., 2020), which targets at quickly generalize to new tasks (i.e. personalized recommendation for cold-start users) with a few training samples (i.e. a few interaction histories). A number of works (Lee et al., 2019; Dong et al., 2020; Lu et al., 2020; Yu et al., 2021; Wang et al., 2021) adopt the classic gradient-based meta-learning strategy called model-agnostic meta-learning (MAML) (Finn et al., 2017), which learns a good initialized parameter from a set of tasks and adapts it to a new task by taking a few steps of gradient descent updates on a limited number of labeled samples. This line of models has demonstrated high potential of alleviating user cold-start problem. However, gradient-based meta-learning strategy require expertise to tune the optimization procedure to avoid over-fitting. Besides, the inference time can be long.

Instead of adapting to each user by fine-tuning via gradient descent, another line of works uses hypernetworks (Ha et al., 2017) to directly map user interaction history to user-specific parameters (Dong et al., 2020; Lin et al., 2021; Feng et al., 2021; Pang et al., 2022). These modulation-based methods consist of embedding layer, adaptation network and predictor. The adaptation network generates user-specific parameters, which are used to modulate the predictor in the form of a modulation function. Particularly, they all adopt feature-wise linear modulation function (FiLM) (Perez et al., 2018), which modulates the representation via scaling and shifting based on the conditioning information, to modulate the user cold-start models to obtain user-specific representation. Although FiLM has been proved to be highly effective in on images (Requeima et al., 2019) and graphs such molecules and protein-protein interaction graphs (Brockschmidt, 2020), applying scaling and shifting on user interaction history has rather obscure physical meaning. Moreover, choosing where to modulate is hard to decide. Existing works modulate different parts of the model, such as last layers of the decoder (Lin et al., 2021) and most layers in the model (Pang et al., 2022). How to modulate well for different users, and how to choose the right functions at the right positions to modulate, are still open questions.

In this paper, we propose ColdNAS to find appropriate modulation structure for user cold-start problem by neural architecture search (NAS). Although NAS methods have been applied in RSs  (Gao et al., 2021a; Xie et al., 2021), the design of NAS methods are problem-specific. For user cold-start problem, it is still unknown how to (i) design a search space that can cover effective cold-start models with good performance for various datasets, and (ii) design an efficient and robust search algorithm. To solve the above challenges, we design a search space of modulation structure, which can cover not only existing modulation-based user cold-start models, but also contain more flexible structures. We theoretically prove that the proposed search space can be transformed to an equivalent space, where we search efficiently and robustly by differentiable architecture search. Our main contributions are summarized as follows:

  • We propose ColdNAS, a modulation framework for user cold-start problem. We use a hypernetwork to map each user’s history interactions to user-specific parameters which are then used to modulate the predictor, and formulate how to modulate and where to modulate as a NAS problem.

  • We design a search space of modulation structure, which can cover not only existing modulation-based user cold-start models, but also contain more expressive structures. As this search space can be large to search, we conduct search space transformation to transform the original space to an equivalent but much smaller space. Theoretical analysis is provided to validate its correctness. Upon the transformed space, we then can search efficiently and robustly by differentiable architecture search algorithm.

  • We perform extensive experiments on benchmark datasets for user cold-start problem, and observe that ColdNAS consistently obtains the state-of-the-art performance. We also validate the design consideration of search spaces and algorithms, demonstrating the strength and reasonableness of ColdNAS.

2. Related Works

2.1. User Cold-Start Recommendation

Making personalized recommendation for cold-start users is particular challenging, as these users only have a few interaction histories (Schein et al., 2002). In the past, collaborative filtering (CF)-based (Koren, 2008; Sedhain et al., 2015; He et al., 2017) methods, which make predictions by capturing interactions among users and items to represent them in low-dimensional space, obtain leading performance in RSs. However, these CF-based methods make inferences only based on the user’s history. They cannot handle user cold-start problem. To alleviate the user cold-start problem, content-based methods leverage user/item features  (Schein et al., 2002) or even user social relations (Lin et al., 2013) to help to predict for cold-start users. The recent deep model DropoutNet (Volkovs et al., 2017) trains a neural network with dropout mechanism applied on input samples and infers the missing data. However, it is hard to generalize these content-based methods to new users, which usually requires model retraining.

A recent trend is to model user cold-start problem as a few-shot learning problem (Wang et al., 2020). The resultant models learn the ability to quickly generalize to recommend for new users with a few interaction histories. Most works mainly follow the classic gradient-based meta-learning strategy MAML (Finn et al., 2017), which first learns a good initialized parameter from training tasks, then locally update the parameter on the provided interaction history by gradient descent. In particular, existing works consider different directions to improve the performance: MeLU (Lee et al., 2019) selectively adapts model parameters to the new task in the local update stage, MAMO (Dong et al., 2020) introduces external memory to guide the model to adapt, MetaHIN (Lu et al., 2020) uses heterogeneous information networks to leverage the rich semantics between users and items, REG-PAML (Yu et al., 2021) proposes to use user-specific learning rate during local update, and PAML (Wang et al., 2021) leverages social relations to share information among similar users. While these approaches can adapt models to training data, they are computationally inefficient at test-time, and usually require expertise to tune the optimization procedure to avoid over-fitting.

2.2. Neural Architecture Search

Neural architecture search (NAS) targets at finding an architecture with good performance without human tuning (Hutter et al., 2019). Recently, NAS methods have been applied in RSs. SIF (Yao et al., 2020a) searches for interaction function in collaborative filtering, AutoCF (Gao et al., 2021b) further searches for basic components including input encoding, embedding function, interaction function, and prediction function in collaborative filtering. AutoFIS (Liu et al., 2020), AutoCTR (Song et al., 2020) and FIVES (Xie et al., 2021) search for effective feature interaction in click-through rate prediction. AutoLoss (Zhao et al., 2021) searches for loss function in RSs. Due to different problem settings, search spaces needs to problem-specific and cannot be shared or transferred. Therefore, none of these works can be applied for user cold-start recommendation problem. To search efficiently on the search space, one can choose reinforcement learning methods (Baker et al., 2017), evolutionary algorithms (Real et al., 2019), and one-shot differentiable architecture search algorithms (Liu et al., 2019; Pham et al., 2018; Yao et al., 2020b). Among them, one-shot differentiable architecture search algorithms have demonstrated higher efficiency. Instead of training and evaluating different models like classical methods, they optimize only one supernet where the model parameters are shared across the search space and co-adapted.

3. Proposed Method

In this section, we present the details of ColdNAS, whose overall architecture is shown in Figure 1. In the sequel, we first provide the formal problem formulation of user cold-start problem (Section 3.1). Then, we present our search space and theoretically show how to transform it (Section 3.2). Finally, we introduce the search algorithm to search for a good user cold-start model (Section 3.3).

3.1. Problem Formulation

Let user set be denoted 𝒰={ui}\mathcal{U}=\{u_{i}\}, where each user uiu_{i} is associated with user features. The feature space is shared across all users. Let item set be denoted 𝒱={vj}\mathcal{V}=\{v_{j}\} where each item vjv_{j} is also associated with item features. When a user uiu_{i} rates an item vjv_{j}, the rating is denoted yi,jy_{i,j}. In user cold-start recommendation problem, the focus is to make personalized recommendation for user uiu_{i} who only has rated a few items.

Following recent works (Bharadhwaj, 2019; Lu et al., 2020; Lin et al., 2021), we model the user cold-start recommendation problem as a few-shot learning problem. The target is to learn a model from a set of training user cold-start tasks 𝒯train\mathcal{T}^{\text{train}} and generalize to provide personalized recommendation for new tasks. Each task TiT_{i} corresponds to a user uiu_{i}, with a support set 𝒮i={(vj,yi,j)}j=1N\mathcal{S}_{i}=\{(v_{j},y_{i,j})\}_{j=1}^{N} containing existing interaction histories and a query set 𝒬i={(vj,yi,j)}j=1M\mathcal{Q}_{i}=\{(v_{j},y_{i,j})\}_{j=1}^{M} containing interactions to predict. NN and MM are the number of interactions in 𝒮i\mathcal{S}_{i}. 𝒬i\mathcal{Q}_{i} and NN are small.

3.2. Search Space

Existing modulation-based user cold-start works can be summarized into the following modulation framework consisting of three parts, i.e., embedding layer, adaptation network and predictor, as plotted in Figure 1.

  • The embedding layer EE with parameter 𝜽E\bm{\theta}_{E} embeds the categorical features from users and items into dense vectors, i.e., (𝒖i,𝒗j)=E(ui,vj;𝜽E)(\bm{u}_{i},\bm{v}_{j})=E(u_{i},v_{j};\bm{\theta}_{E}).

  • The adaptation network AA with parameter 𝜽A\bm{\theta}_{A} takes the support set 𝒮i\mathcal{S}_{i} for a specific user uiu_{i} as input and generates user-specific adaptive parameters, i.e.,

    (1) 𝚽i={ϕik}k=1C=A(𝒮i;𝜽A),\displaystyle\bm{\Phi}_{i}=\{\bm{\phi}_{i}^{k}\}_{k=1}^{C}=A(\mathcal{S}_{i};\bm{\theta}_{A}),

    where CC is the number of adaptive parameter groups for certain modulation structure.

  • The predictor PP with parameter 𝜽P\bm{\theta}_{P} takes user-specific parameters ϕi\bm{\phi}_{i} and vq𝒬iv_{q}\in\mathcal{Q}_{i} from the query set as input, and generate predictions by

    (2) y^i,j=P((𝒖i,𝒗q),𝚽i;𝜽P).\hat{y}_{i,j}=P((\bm{u}_{i},\bm{v}_{q}),\bm{\Phi}_{i};\bm{\theta}_{P}).

Comparing with classical RS models, the extra adaptation network is introduced to handle cold-start users. To make personalized recommendation, for each uiu_{i}, the support set 𝒮i\mathcal{S}_{i} is first mapped to user-specific parameter ϕi\bm{\phi}_{i} by (1). Then taking the features of target item vq𝒬iv_{q}\in\mathcal{Q}_{i}, user features uiu_{i}, and the ϕi\bm{\phi}_{i}, prediction is made as (2). Subsequently, how to use the user-specific parameter ϕi\bm{\phi}_{i} to change the prediction process in (2) can be crucial to the performance. Usually, a multi-layer perception (MLP) is used as PP (Cheng et al., 2016; He et al., 2017; Lin et al., 2021). Assume a LL-layer MLP is used and 𝒉l\bm{h}^{l} denotes its output from the llth layer, and let 𝒉0=(𝒖i,𝒗q)\bm{h}^{0}=(\bm{u}_{i},\bm{v}_{q}) for notation simplicity. For the iith user, 𝒉l+1\bm{h}^{l+1} is modulated as

(3) 𝒉il\displaystyle\bm{h}_{i}^{l} =Ml(𝒉l,𝚽i),\displaystyle=M^{l}(\bm{h}^{l},\bm{\Phi}_{i}),
(4) 𝒉l+1\displaystyle\bm{h}^{l+1} =ReLU(𝑾Pl𝒉il+𝒃Pl),\displaystyle=\text{ReLU}(\bm{W}^{l}_{P}\bm{h}_{i}^{l}+\bm{b}^{l}_{P}),

where MlM^{l} is the modulation function, 𝑾l\bm{W}^{l} and 𝒃l\bm{b}^{l} are learnable weights at the llth layer. This MlM^{l} controls how 𝒉l\bm{h}^{l} is personalized w.r.t. the iith user. The recent TaNP (Lin et al., 2021) directly lets MlM^{l} in (3) adopt the form of FiLM for all LL MLP layers:

(5) 𝒉il=𝒉lϕi1+ϕi2.\displaystyle\bm{h}^{l}_{i}=\bm{h}^{l}\odot\bm{\phi}_{i}^{1}+\bm{\phi}_{i}^{2}.

FiLM applies a feature-wise affine transformation on the intermediate features, and has been proved to be highly effective in other domains (Requeima et al., 2019; Brockschmidt, 2020). However, users and items can have diverse interaction patterns. For example, both the inner product and summation have been used in RS to measure the preference of user over items (Koren, 2008; Hsieh et al., 2017).

In order to find the appropriate modulation function, we are motivated to search MlM^{l} for different recommendation tasks. We design the following space for MlM^{l}:

(6) 𝒉il=𝒉lop1ϕi1op2ϕi2opCϕiC,\displaystyle\bm{h}^{l}_{i}=\bm{h}^{l}\circ_{\text{op}^{1}}\bm{\phi}_{i}^{1}\circ_{\text{op}^{2}}\bm{\phi}_{i}^{2}\cdots\circ_{\text{op}^{C}}\bm{\phi}_{i}^{C},

where opi\text{op}^{i}’ are defined as

opi𝒪{max,min,,/,+,}.\displaystyle\circ_{\text{op}_{i}}\in\mathcal{O}\equiv\big{\{}\max,\min,\odot,/,+,-\big{\}}.

They are all commonly used simple dimension-preserving binary operations. We choose them to avoid the resultant MlM^{l} being too complex, which can easily overfit for cold-start users.

The search space in (6) can be viewed as a binary tree, as shown in Figure 2(a). Since MlM^{l} can be different for each layer, the size of this space is 6C×L6^{C\times L}. A larger CC leads to a larger search space, which has higher potential of containing appropriate modulation function but is also more challenging to search effectively.

Refer to caption
(a) Original search space.
Refer to caption
(b) Transformed search space.
Refer to caption
(c) A layer in supernet.
Figure 2. Illustration of our proposition 3.1, and the structure of the supernet to search on the reduced space.

3.3. Search Strategy

We aim to find an efficient and practicable search algorithm on the proposed search space. However, the search space can be very large, and differentiable NAS methods are known to be fragile on large spaces (Yu et al., 2019). In the sequel, we propose to first transform the original search space to an equivalent but much smaller space, as shown in Figure 2(a), where the equivalence is inspired by some similarities between the operations and the expressiveness of deep neural networks. On the transformed space, we design a supernet structure to conduct efficient and robust differentiable search.

3.3.1. Search Space Transformation

Though the aforementioned search space can be very large, we can transform it to an equivalent space of size 24×L2^{4\times L} which is invariant with CC, as proved in Proposition 3.1111The proof is in Appendix B. below.

Proposition 3.1 (Search Space Transformation).

Assume the adaptation network AA is expressive enough. Any MlM^{l} with a form of (6) where opk𝒪\circ_{\text{op}^{k}}\in\mathcal{O}, CC is any non-negative integer, and ϕik𝚽i=A(𝒮i,𝛉A)\bm{\phi}_{i}^{k}\in\bm{\Phi}_{i}=A(\mathcal{S}_{i},\bm{\theta}_{A}), can be represented as

(7) 𝒉il=min(max(𝒉l,ϕ^i1),ϕ^i2)ϕ^i3+ϕ^i4,\displaystyle\bm{h}^{l}_{i}=\min(\max(\bm{h}^{l},~{}\hat{\bm{\phi}}_{i}^{1}),~{}\hat{\bm{\phi}}_{i}^{2})\odot\hat{\bm{\phi}}_{i}^{3}+\hat{\bm{\phi}}_{i}^{4},

and the above four operations are permutation-invariant.

The intuitions are as follows. First, operations in 𝒪\mathcal{O} can be divided into four groups: G1={max},G2={min},G3={+,},G4={,/}G_{1}=\{\max\},~{}G_{2}=\{\min\},~{}G_{3}=\{+,-\},~{}G_{4}=\{\odot,/\}. Then, with mild assumption on adaptation network, we can prove two important properties: (i) inner-group consistence: operations that in the same group can be associated; and (ii) inter-group permutation-invariance: operations that are not in the same group are permutation-invariant which means any two operations can switch with another. Thus, we can recurrently commute operations until operations in the same group are neighbors, and associate operations in the four groups respectively.

Remark 1.

To better understand Proposition 3.1, let us check two examples.

  1. (1)

    min(max(𝒉l,ϕi1)+ϕi2ϕi3,ϕi4)ϕi5\min(\max(\bm{h}^{l},~{}\bm{\phi}_{i}^{1})+\bm{\phi}_{i}^{2}-\bm{\phi}_{i}^{3},~{}\bm{\phi}_{i}^{4})\odot\bm{\phi}_{i}^{5} equals to (7) where ϕ^i1=ϕi1,ϕ^i2=ϕi4ϕi2+ϕi3,ϕ^i3=ϕi5,ϕ^i4=(ϕi2ϕi3)ϕi5\hat{\bm{\phi}}_{i}^{1}=\bm{\phi}_{i}^{1},~{}\hat{\bm{\phi}}_{i}^{2}=\bm{\phi}_{i}^{4}-\bm{\phi}_{i}^{2}+\bm{\phi}_{i}^{3},~{}\hat{\bm{\phi}}_{i}^{3}=\bm{\phi}_{i}^{5},~{}\hat{\bm{\phi}}_{i}^{4}=(\bm{\phi}_{i}^{2}-\bm{\phi}_{i}^{3})\odot\bm{\phi}_{i}^{5}; and

  2. (2)

    max(min(𝒉l+ϕi1,ϕi2),ϕi3)ϕi4\max(\min(\bm{h}^{l}+\bm{\phi}_{i}^{1},~{}\bm{\phi}_{i}^{2}),~{}\bm{\phi}_{i}^{3})\odot\bm{\phi}_{i}^{4} also equals to (7) where ϕ^i1=ϕi3ϕi1,ϕ^i2=ϕi2ϕi1,ϕ^i3=ϕi4,ϕ^i4=ϕi1ϕi4\hat{\bm{\phi}}_{i}^{1}=\bm{\phi}_{i}^{3}-\bm{\phi}_{i}^{1},~{}\hat{\bm{\phi}}_{i}^{2}=\bm{\phi}_{i}^{2}-\bm{\phi}_{i}^{1},~{}\hat{\bm{\phi}}_{i}^{3}=\bm{\phi}_{i}^{4},~{}\hat{\bm{\phi}}_{i}^{4}=\bm{\phi}_{i}^{1}\odot\bm{\phi}_{i}^{4}.

Note that due to the universal approximation ability of deep network (Hornik et al., 1989), the assumption in this proposition can be easily satisfied. For example, we implement AA based on two-layer MLP in experiments (see Appendix A.2), which can already ensure a good performance (see Section 4.3.3). After the transformation, the space in (7) also spans a permutation-invariant binary tree, as plotted in Figure 2(b). Such a space can be significantly smaller than the original one, as explained in Remark 2 below.

Remark 2.

The space transformation plays an essential role in ColdNAS, Table 1 helps to better understand to what extent the proposition can help reduce the search space, we take layer number L=4L=4 and the ratio is calculated as original spacetransformed space=6C×424×4\frac{\text{original space}}{\text{transformed space}}=\frac{6^{C\times 4}}{2^{4\times 4}}.

Table 1. The reduction ratio w.r.t different CC.
CC 1 2 3 4 5 6
Ratio 2.0×1022.0\!\times\!10^{-2} 2.6×1012.6\!\times\!10^{1} 3.3×1043.3\!\times\!10^{4} 4.3×1074.3\!\times\!10^{7} 5.6×10105.6\!\times\!10^{10} 7.2×10137.2\!\times\!10^{13}

Note that when C=1C=1, the transformation will not lead to a reduction on the space. However, such a case is not meaningful as it suffers from poor performance due to lack of flexibility for the modulation function (see Section 4.3.2). The transformed spaces enable efficient and robust differentiable search, via reducing the number of architecture parameter from 6×C×L6\times C\times L to 4×L4\times L. Meanwhile, the space size is transformed from 6C×L6^{C\times L} to 24×L2^{4\times L}, which is a reduction for any C>1C>1.

3.3.2. Construction of the Supernet

For each MlM^{l}, since there are 44 operations at most and they are permutation-invariant, we only need to decide whether to take the operation or not by any order. We search by introducing differentiable parameters to weigh the operations and optimize the weights. For the llth layer of the predictor, we have

(8) 𝒉^l,k+1=αl,k+1(𝒉^l,kopk+1ϕik+1)+(1αl,k+1)𝒉^l,k,\displaystyle\bm{\hat{h}}^{l,k+1}\!=\!\alpha^{l,k+1}(\bm{\hat{h}}^{l,k}\circ_{\text{op}^{k+1}}\bm{\phi}_{i}^{k+1})\!+\!(1\!-\!\alpha^{l,k+1})\bm{\hat{h}}^{l,k},

where αl,k+1\alpha^{l,k+1} is a weight to measure operation opk+1\circ_{\text{op}^{k+1}} in MlM^{l}, k{0,1,2,3}k\in\{0,1,2,3\} and {opk+1}k=03={max,min,,+}\{\circ_{\text{op}^{k+1}}\}_{k=0}^{3}=\{\max,\min,\odot,+\}. For notation simplicity, we let 𝒉^l,0=𝒉l,𝒉il=𝒉^l,4\bm{\hat{h}}^{l,0}=\bm{h}^{l},~{}\bm{h}^{l}_{i}=\bm{\hat{h}}^{l,4}. We construct the supernet by replacing (3) with (8), i.e., replacing every MlM^{l} in red dotted lines in Figure 1 with the structure shown in Figure 2 (c).

3.3.3. Complete Algorithm

The complete algorithm is summarized in Algorithm 1. We first optimize the supernet and make selection to determine the structure, then reconstruct the model with determined structure and retrain it to inference.

Algorithm 1 Training procedure of ColdNAS.
0:  Learning rate β\beta, number of operations to keep KK.
1:  Construct the supernet by (8) and randomly initialize all parameters 𝚯={{αl,k}k=1,l=04,L1,𝜽E,𝜽A,𝜽P}\bm{\Theta}=\{\{\alpha^{l,k}\}_{k=1,l=0}^{4,L-1},\bm{\theta}_{E},\bm{\theta}_{A},\bm{\theta}_{P}\}.
2:  while Not converge do
3:     for Every Ti𝒯trainT_{i}\in\mathcal{T}^{\text{train}}  do
4:        Calculate 𝚽i\bm{\Phi}_{i} by (1).
5:        Calculate y^i,j\hat{y}_{i,j} for every vjv_{j} in 𝒬i\mathcal{Q}_{i} by (2).
6:        Calculate loss i\mathcal{L}_{i} by (10).
7:     end for
8:     train=1|𝒯train|i=1|𝒯train|i\mathcal{L}^{\text{train}}=\frac{1}{|\mathcal{T}^{\text{train}}|}\sum_{i=1}^{|\mathcal{T}^{\text{train}}|}\mathcal{L}_{i}
9:     Update all parameters 𝚯𝚯β𝚯train\bm{\Theta}\leftarrow\bm{\Theta}-\beta\nabla_{\bm{\Theta}}\mathcal{L}^{\text{train}}.
10:  end while
11:  Determine the modulation structure by keeping operations corresponding to Top-KK αl,k\alpha^{l,k} and remove the others.
12:  Construct the model with determined modulation structure and randomly initialize all parameters 𝚯={𝜽E,𝜽A,𝜽P}\bm{\Theta}=\{\bm{\theta}_{E},\bm{\theta}_{A},\bm{\theta}_{P}\}.
13:  Train the model in the same way as Step 2102\sim 10.
14:  Return: The trained model.

Benefited from the great reduction brought by the space transformation, while conventional differentiable architecture search (Liu et al., 2019) optimizes the supernet w.r.t. the bilevel objective with the upper-level variable 𝜶\bm{\alpha} and lower-level variable 𝜽\bm{\theta} :

(9) min𝜶val(𝜽(𝜶),𝜶), s.t. 𝜽(𝜶)=argmin𝜽train(𝜽,𝜶),\displaystyle\mathop{\min}_{\bm{\alpha}}\mathcal{L}^{\text{val}}(\bm{\theta}^{*}(\bm{\alpha}),\bm{\alpha}),\text{\;s.t.\;}\bm{\theta}^{*}(\bm{\alpha})=\underset{\bm{\theta}}{\operatorname{argmin}}~{}\mathcal{L}^{\text{train}}(\bm{\theta},\bm{\alpha}),

where val\mathcal{L}^{\text{val}} and train\mathcal{L}^{\text{train}} represent the loss obtained on validation set and training set respectively, we only need to optimize the supernet w.r.t. objective train\mathcal{L}^{\text{train}} only, in an end-to-end manner by episodic training. For every task Ti𝒯trainT_{i}\in\mathcal{T}^{\text{train}}, we first input 𝒮i\mathcal{S}_{i} to the adaptation network AA to generate 𝚽i\bm{\Phi}_{i} by (1), and then for every item vj𝒬iv_{j}\in\mathcal{Q}_{i}, we take (ui,vj)(u_{i},v_{j}) and 𝚽i\bm{\Phi}_{i} as input of the predictor PP and make prediction by (2). Then, we use mean squared error (MSE) between the prediction y^i,j\hat{y}_{i,j} and true label yi,jy_{i,j} as loss function:

(10) i=1Mj=1M(yi,jy^i,j)2,\displaystyle\mathcal{L}_{i}=\frac{1}{M}\sum\nolimits_{j=1}^{M}(y_{i,j}-\hat{y}_{i,j})^{2},

and train=Ti𝒯traini\mathcal{L}^{\text{train}}=\sum\nolimits_{T_{i}\in\mathcal{T}^{\text{train}}}\mathcal{L}_{i}. We update all parameters by gradient descent. Once the supernet converges, we determine all MlM^{l}s jointly by keeping the operation corresponding to the Top-KK largest values among the 4×L4\times L values in {αl,k+1}k=1,l=04,L1\{\alpha^{l,k+1}\}_{k=1,l=0}^{4,L-1}. We then retrain the model to obtain the final user cold-start model with searched modulation structure. During inference, a new set of tasks 𝒯test\mathcal{T}^{\text{test}} is given, which is disjoint from 𝒯train\mathcal{T}^{\text{train}}. For Ti𝒯testT_{i}\in\mathcal{T}^{\text{test}}, we take the whole 𝒮i\mathcal{S}_{i} and (ui,vj)(u_{i},v_{j}) as input to the trained model to obtain prediction y^i,j\hat{y}_{i,j} for each item vj𝒬iv_{j}\in\mathcal{Q}_{i}.

3.4. Discussion

Existing works in space transformation can be divided into three types. One is using greedy search strategy (Gao et al., 2021a). Methods of this kind explore different groups of architectures in the search space greedily. Thus, they fail to explore the full search space and can easily fall into bad local optimal. Another is mapping the search space into a low-dimensional one. For examples, using auto-encoder (Luo et al., 2018) or sparse coding (Yang et al., 2020). However, these types of methods do not consider special properties of the search problem, e.g., the graph structure of the supernet. Thus, the performance ranking of architectures may suffer from distortion in the low-dimensional space. ColdNAS belongs to the third type, which is to explore architecture equivalence in the search space. The basic idea is that if we can find a group of architectures that are equivalent with each other, then evaluation any one of them is enough for all architectures in the same group. This type of methods is problem-specific. For example, perturbation equivalence for matrices is explored in (Zhang et al., 2023), morphism of networks is considered in (Jin et al., 2019). The search space of ColdNAS is designed for modulation structures in cold-start problem, which has not been explored before. We theoretically prove the equivalence between the original space and the transformed space, which then significantly reduces the space size (Remark 2). Other methods for general space reduction cannot achieve that.

Table 2. Test performance (%) obtained on benchmark datasets. The best results are highlighted in bold and the second-best in italic. For MSE and MAE, smaller value is better. For nDCG3\text{nDCG}_{3} and nDCG5\text{nDCG}_{5}, larger value is better.
Dataset Metric DropoutNet MeLU MetaCS MetaHIN MAMO TaNP ColdNAS-Fixed ColdNAS
MovieLens MSE 100.90(0.70)100.90_{(0.70)} 95.02(0.03)95.02_{(0.03)} 95.05(0.04)95.05_{(0.04)} 91.89(0.06)91.89_{(0.06)} 90.20(0.22)90.20_{(0.22)} 89.11¯(0.18)\underline{89.11}_{(0.18)} 91.05(0.13)91.05_{(0.13)} 87.96(0.12)\textbf{87.96}_{(0.12)}
MAE 85.71(0.48)85.71_{(0.48)} 77.38(0.25)77.38_{(0.25)} 77.42(0.26)77.42_{(0.26)} 75.79(0.27)75.79_{(0.27)} 75.34(0.26)75.34_{(0.26)} 74.78¯(0.14)\underline{74.78}_{(0.14)} 75.65(0.30)75.65_{(0.30)} 74.29(0.20)\textbf{74.29}_{(0.20)}
nDCG3\text{nDCG}_{3} 69.21(0.76)69.21_{(0.76)} 74.43(0.59)74.43_{(0.59)} 74.46(0.78)74.46_{(0.78)} 74.69(0.32)74.69_{(0.32)} 74.95(0.13)74.95_{(0.13)} 75.60¯(0.07)\underline{75.60}_{(0.07)} 75.11(0.09)75.11_{(0.09)} 76.16(0.03)\textbf{76.16}_{(0.03)}
nDCG5\text{nDCG}_{5} 68.43(0.48)68.43_{(0.48)} 73.52(0.41)73.52_{(0.41)} 73.45(0.56)73.45_{(0.56)} 73.63(0.22)73.63_{(0.22)} 73.84(0.16)73.84_{(0.16)} 74.29¯(0.12)\underline{74.29}_{(0.12)} 73.89(0.12)73.89_{(0.12)} 74.74(0.09)\textbf{74.74}_{(0.09)}
BookCrossing MSE 15.38(0.23)15.38_{(0.23)} 15.15(0.02)15.15_{(0.02)} 15.20(0.08)15.20_{(0.08)} 14.76(0.07)14.76_{(0.07)} 14.82(0.05)14.82_{(0.05)} 14.75(0.05){14.75}_{(0.05)} 14.44¯(0.16)\underline{14.44}_{(0.16)} 14.15(0.08)\textbf{14.15}_{(0.08)}
MAE 3.75(0.01)3.75_{(0.01)} 3.68(0.01)3.68_{(0.01)} 3.66(0.01)3.66_{(0.01)} 3.50(0.01)3.50_{(0.01)} 3.51(0.02)3.51_{(0.02)} 3.48¯(0.01)\underline{3.48}_{(0.01)} 3.49(0.02)3.49_{(0.02)} 3.40(0.01)\textbf{3.40}_{(0.01)}
nDCG3\text{nDCG}_{3} 77.66(0.18)77.66_{(0.18)} 77.69¯(0.15)\underline{77.69}_{(0.15)} 77.68(0.12)77.68_{(0.12)} 77.66(0.19)77.66_{(0.19)} 77.68(0.09)77.68_{(0.09)} 77.48(0.06)77.48_{(0.06)} 77.65(0.09)77.65_{(0.09)} 77.83(0.01)\textbf{77.83}_{(0.01)}
nDCG5\text{nDCG}_{5} 80.87(0.15)80.87_{(0.15)} 81.10(0.15)81.10_{(0.15)} 80.97(0.09)80.97_{(0.09)} 80.95(0.04)80.95_{(0.04)} 81.01(0.05)81.01_{(0.05)} 81.16¯(0.21)\underline{81.16}_{(0.21)} 81.12(0.06)81.12_{(0.06)} 81.32(0.10)\textbf{81.32}_{(0.10)}
Last.fm MSE 21.91(0.38)21.91_{(0.38)} 21.69(0.34)21.69_{(0.34)} 21.68(0.12)21.68_{(0.12)} 21.43¯(0.23)\underline{21.43}_{(0.23)} 21.64(0.10)21.64_{(0.10)} 21.58(0.20)21.58_{(0.20)} 21.62(0.16)21.62_{(0.16)} 20.91(0.05)\textbf{20.91}_{(0.05)}
MAE 43.02(0.52)43.02_{(0.52)} 42.28(1.21)42.28_{(1.21)} 42.28(0.76)42.28_{(0.76)} 42.07¯(0.49)\underline{42.07}_{(0.49)} 42.30(0.28)42.30_{(0.28)} 42.15(0.56)42.15_{(0.56)} 42.32(0.34)42.32_{(0.34)} 41.78(0.24)\textbf{41.78}_{(0.24)}
nDCG3\text{nDCG}_{3} 75.13(0.48)75.13_{(0.48)} 80.15(2.09)80.15_{(2.09)} 80.81(0.97)80.81_{(0.97)} 82.01¯(0.56)\underline{82.01}_{(0.56)} 80.73(0.80)80.73_{(0.80)} 81.03(0.36)81.03_{(0.36)} 80.77(0.32)80.77_{(0.32)} 82.80(0.69)\textbf{82.80}_{(0.69)}
nDCG5\text{nDCG}_{5} 69.03(0.31)69.03_{(0.31)} 75.03(0.68)75.03_{(0.68)} 75.01(0.64)75.01_{(0.64)} 75.98¯(0.33)\underline{75.98}_{(0.33)} 75.45(0.29)75.45_{(0.29)} 75.98¯(0.41)\underline{75.98}_{(0.41)} 75.48(0.21)75.48_{(0.21)} 76.77(0.10)\textbf{76.77}_{(0.10)}
Table 3. Modulation structure with Top-44 operations searched on the three benchmark datasets respectively. We also show the modulation structure of ColdNAS-Fixed, which is the same regardless of the dataset used.
M0M^{0} M1M^{1} M2M^{2} M3M^{3}
MovieLens min(max(𝒉0,ϕi0,1),ϕi0,2)+ϕi0,3\min(\max(\bm{h}^{0},\bm{\phi}_{i}^{0,1}),\bm{\phi}_{i}^{0,2})+\bm{\phi}_{i}^{0,3} 𝒉1+ϕi1,1\bm{h}^{1}+\bm{\phi}_{i}^{1,1} 𝒉2\bm{h}^{2} 𝒉3\bm{h}^{3}
BookCrossing min(𝒉0,ϕi0,1)\min(\bm{h}^{0},\bm{\phi}_{i}^{0,1}) 𝒉1+ϕi1,1\bm{h}^{1}+\bm{\phi}_{i}^{1,1} 𝒉2ϕi2,1+ϕi2,2\bm{h}^{2}\odot\bm{\phi}_{i}^{2,1}+\bm{\phi}_{i}^{2,2} 𝒉3\bm{h}^{3}
Last.fm 𝒉0+ϕi0,1\bm{h}^{0}+\bm{\phi}_{i}^{0,1} 𝒉1+ϕi1,1\bm{h}^{1}+\bm{\phi}_{i}^{1,1} max(𝒉2,ϕi2,1)+ϕi2,2\max(\bm{h}^{2},\bm{\phi}_{i}^{2,1})+\bm{\phi}_{i}^{2,2} 𝒉3\bm{h}^{3}
ColdNAS-Fixed 𝒉0ϕi0,1+ϕi0,2\bm{h}^{0}\odot\bm{\phi}_{i}^{0,1}+\bm{\phi}_{i}^{0,2} 𝒉1ϕi1,1+ϕi1,2\bm{h}^{1}\odot\bm{\phi}_{i}^{1,1}+\bm{\phi}_{i}^{1,2} 𝒉2ϕi2,1+ϕi2,2\bm{h}^{2}\odot\bm{\phi}_{i}^{2,1}+\bm{\phi}_{i}^{2,2} 𝒉3ϕi3,1+ϕi3,2\bm{h}^{3}\odot\bm{\phi}_{i}^{3,1}+\bm{\phi}_{i}^{3,2}

4. Experiments

We perform experiments on three benchmark datasets with the aim to answer the following research questions:

  • RQ1: What is the modulation structure selected by ColdNAS and how does ColdNAS perform in comparison with the state-of-the-art cold-start models?

  • RQ2: How can we understand the search space and algorithm of ColdNAS?

  • RQ3: How do hyperparameters affect ColdNAS?

Results are averaged over five runs.

4.1. Datasets

We use three benchmark datasets (Table 4): (i) MovieLens (Harper and Konstan, 2015): a dataset containing 1 million movie ratings of users collected from MovieLens, whose features include gender, age, occupation, Zip code, publication year, rate, genre, director and actor; (ii) BookCrossing (Ziegler et al., 2005): a collection of users’ ratings on books in BookCrossing community, whose features include age, location, publish year, author, and publisher; and (iii) Last.fm: a collection of user’s listening count of artists from Last.fm online system, whose features only consist of user and item IDs. Following Lin et al. (2021), we generate negative samples for the query sets in Last.fm.

Table 4. Summary of datasets used in this paper.
Dataset # User (Cold) # Item # Rating # User Feat. # Item Feat.
MovieLens 6040 (52.3%) 3706 1000209 4 5
BookCrossing 278858 (18.6%) 271379 1149780 2 3
Last.fm 1872 (15.3%) 3846 42346 1 1

Data Split

Following Lin et al. (2021), the ratio of 𝒯train:𝒯val:𝒯test\mathcal{T}^{\text{train}}:\mathcal{T}^{\text{val}}:\mathcal{T}^{\text{test}} is set as 7:1:27:1:2. 𝒯val\mathcal{T}^{\text{val}} is used to judge the convergence of supernet. 𝒯train,𝒯val,𝒯test\mathcal{T}^{\text{train}},\mathcal{T}^{\text{val}},\mathcal{T}^{\text{test}} contain no overlapping users. For MovieLens and Last.fm, we keep any user whose interaction history length lies in [40,200][40,200]. Each support set contains N=20N=20 randomly selected interactions of a user, and query set contains the rest interactions of the same user. As for BookCrossing with severe long-tail distribution of user-item interactions, we particularly put any user whose interaction history length lies in [50,1000)[50,1000) into 𝒯train\mathcal{T}^{\text{train}}. Then, we divide users with interaction history length in [2,50)[2,50) into 70%70\%, 10%10\% and 20%20\% to be put in 𝒯train\mathcal{T}^{\text{train}}, 𝒯val\mathcal{T}^{\text{val}}, 𝒯test\mathcal{T}^{\text{test}} respectively. The proportion of cold users in each dataset is also shown in Table 4. Then, we randomly select half of each user’s interaction history as support set and take the rest as query set.

Evaluation Metric

Following (Lee et al., 2019; Pang et al., 2022), we evaluate the performance by mean average error (MAE), mean squared Error (MSE), normalized discounted cumulative gain nDCG3\text{nDCG}_{3} and nDCG5\text{nDCG}_{5}. MAE and MSE evaluate the numerical gap between the prediction and the ground-truth rating, lower value is better. For nDCG3\text{nDCG}_{3} and nDCG5\text{nDCG}_{5}, the higher value is better, representing the proportion between the discounted cumulative gain of the predicted item list and the ground-truth list.

4.2. Performance Comparison (RQ1)

Refer to caption
(a) MovieLens.
Refer to caption
(b) BookCrossing.
Refer to caption
(c) Last.fm.
Figure 3. Testing MSE vs clock time of different search strategies. ColdNAS, ColdNAS-bilevel and Random-t operate on the transformed space, while Random operates on the original space with C=4C=4 in every MlM^{l}.

We compare ColdNAS with the following representative user cold-start methods: (i) traditional deep cold-start model DropoutNet (Volkovs et al., 2017) and (ii) FSL based methods include MeLU (Lee et al., 2019), MetaCS (Bharadhwaj, 2019), MetaHIN (Lu et al., 2020), MAMO (Dong et al., 2020), and TaNP (Lin et al., 2021). We run the public codes provided by the respective authors. PNMTA (Pang et al., 2022), CMML (Feng et al., 2021), PAML (Wang et al., 2021) and REG-PAML (Yu et al., 2021) are not compared due to the lack of public codes. We choose a 4-layer predictor, more details of our model and parameter setting are provided in Appendix C.1. We also compare with a variant of ColdNAS called ColdNAS-Fixed, which uses the fixed FiLM function in (5) at every layer rather than our searched modulation function.

Table 2 shows the overall user-cold start recommendation performance for all methods. We can see that ColdNAS significantly outperforms the others on all the datasets and metrics. Among all compared baselines, DropoutNet performs the worst as it is not a few-shot learning method that the model has no ability to adapt to different users. Among meta-learning based methods, MeLU, MetaCS, MetaHIN and MAMO adopt gradient-based meta-learning strategy, which may suffer from overfitting during local-updates. In contrast, TaNP and ColdNAS learn to generate user-specific parameters to guide the adaptation. TaNP uses a fixed modulation structure which may not be optimal for different datasets, while ColdNAS automatically finds the optimal structure. Further, the consistent performance gain of ColdNAS over ColdNAS-Fixed validates the necessity of searching modulation structure to fit datasets instead of using a fixed one.

Table 3 shows the searched modulation structures. We find that the searched modulation structures are different across the datasets. No modulation should be taken at the last layer, as directly use user-specific preference possibly leads to overfitting. Note that all the operations in the transformed search space have been selected at least once, which means all of them are helpful.

Refer to caption
(a) MovieLens.
Refer to caption
(b) BookCrossing.
Refer to caption
(c) Last.fm.
Figure 4. Testing MSE vs clock time on various search spaces: transformed space, and original spaces with different CC.

Comparing time consumption to other cold-start models, ColdNAS only additionally optimizes a supernet with the same objective and comparable size. Table 5 shows the clock time taken by ColdNAS and TaNP which obtains the second-best in Table 2. As can be observed, the searching in ColdNAS is very efficient, as it is only slightly more expensive than retraining.

Table 5. Clock time taken by ColdNAS and TaNP.
Clock time (min) MovieLens BookCrossing Last.fm
TaNP 15.5 44.2 4.1
ColdNAS Search 16.2 45.5 3.9
Retrain 12.7 35.5 3.5

4.3. Searching in ColdNAS (RQ2)

4.3.1. Choice of Search Strategy

In ColdNAS, we optimize (10) by gradient descent. In this section, we compare ColdNAS with other search strategies to search the modulation structure, including Random (C=𝟒\bm{C=4}) which conducts random search (Bergstra and Bengio, 2012) on the original space with operation number C=4C=4 in each MlM^{l}, Random (T) which conducts random search (Bergstra and Bengio, 2012) on the transformed space, ColdNAS-Bilevel which optimizes the supernet with the bilevel objective 9. For random search, we record its training and evaluation time. For ColdNAS and ColdNAS-bilevel, we sample the architecture parameter of the supernet and retrain from scratch, and exclude the time for retraining and evaluation. Figure 3 shows the performance of the best searched architecture. One can observe that searching on the transformed space allows higher efficiency and accuracy, comparing Random (T) to Random (C=4C=4). In addition, differentiable search on the supernet structure has brought more efficiency, as both ColdNAS and ColdNAS-Bilevel outperform random search. ColdNAS and ColdNAS-Bilevel converge to similar testing MSE, but ColdNAS is faster. This validates the efficacy of directly using gradient descent on all parameters in ColdNAS.

Refer to caption
(a) Varying KK in Top-KK.
Refer to caption
(b) Varying number of layers of the predictor.
Refer to caption
(c) Varying support set size during inference.
Figure 5. Model sensitivity analysis of ColdNAS on Last.fm.

4.3.2. Necessity of Search Space Transformation

Now that the search algorithm has been chosen, we now particularly examine the necessity of search space transformation in terms of time and performance. As shown in Table 1, a larger CC which constrains the number of operations will lead to a larger original space. And the space reduction ratio grows exponentially along with the original space size. Figure 4 plots the testing MSE vs clock time of random search on original spaces of different size (Random (C=cC=c)), random search on transformed space (Random (T)), and our ColdNAS. When CC is small, modulation functions would not be flexible enough, though it is easy to find the optimal candidate in this small search space. When CC is large, the search space is large, where random search would be too time-consuming to find a good modulation structure. In contrast, the transformed search space has a consistent small size, and is theoretically proved to be equal to the space with any CC. As can be seen, both Random (T) and ColdNAS can find good structure more effective than Random (C=4C=4), and ColdNAS searches the fastest via differentiable search.

4.3.3. Understanding Proposition 3.1

Here, we first show that the assumption of adaptation network AA being expressive enough can be easily satisfied. Generally, deeper networks are more expressive. Figure 6 plots the effect of changing the depth of the neural network in AA on Last.fm. As shown in Figure 6 (a-d), although different number of layers in AA are used, the searched modulation operations are the same. Consequently, the performance difference is small, as shown in Figure 6 (d). Thus, we use 22 layers which is expressive enough and has smaller parameter size.

Further, we validate the inner-group consistence and permutation-invariance properties proved in Proposition 3.1. Comparing Figure 6 (a) with Figure 7 (a-b), one can see that changing \odot to // and ++ to - obtains the same results, which validates inner-group consistency. Finally, comparing Figure 6 (a) with Figure 7 (c), one can observe that inter-group permutation invariant also holds.

Refer to caption
(a) 2 layers (chosen).
Refer to caption
(b) 3 layers.
Refer to caption
(c) 4 layers.
Refer to caption
(d) 5 layers.
Refer to caption
(e) Performance.
Figure 6. Effect of changing the depth of adaptation network on Last.fm. Figure 6 (a-d) shows the visualization of the {αl,k}\{\alpha^{l,k}\} obtained with varying depth, and Figure 6 (e) shows the testing MSE obtained correspondingly.
Refer to caption
(a) Changing + to -.
Refer to caption
(b) Changing \odot to /.
Refer to caption
(c) Permutated.
Figure 7. Visualization of the {αl,k}\{\alpha^{l,k}\} obtained on Last.fm, where operations and orders are changed.

4.4. Sensitivity Analysis (RQ3)

Finally, we conduct sensitivity analysis of ColdNAS on Last.fm.

Effects of KK.

Recall that we keep modulation operations corresponding to the Top-KK largest αl,k\alpha^{l,k}s in modulation structure. Figure 5(a) plots the effect of changing KK. As shown, KK cannot be too small that the model is unable to capture enough user-specific preference, nor too large that the model can overfit to the limited interaction history of cold-start users.

Effect of the Depth of Predictor.

Figure 5(b) plots the effect of using predictors with different LL number of layers. One can observe that choosing different LL in a certain range has low impact on the performance, while choosing L=4L=4 is already good enough to obtain the state-of-the-art results as shown in Table 2.

Effects of Support Set Size.

Figure 5(c) shows the performance of users with different length of history. During inference, we randomly sample only a portion of interactions from the original support set 𝒮i\mathcal{S}_{i} of each TiT_{i} in TtestT^{\text{test}}. As can be seen, prediction would be more accurate given interaction history, which may alleviate the bias in representing the user-specific preference.

5. Conclusion

We propose a modulation framework called ColdNAS for user cold-start recommendation. In particular, we use a hypernetwork to map each user’s history interactions to user-specific parameters, which are then used to modulate the predictor. We design a search space for modulation functions and positions, which not only covers existing modulation-based models but also has the ability to find more effective structures. We theoretically prove the space could be transformed to a smaller space, where we can search for modulation structure efficiently and robustly. Extensive experiments show that ColdNAS performs the best on benchmark datasets. Besides, ColdNAS can efficiently find proper modulation structure for different data, which make it easy to be deployed in recommendation systems.

Acknowledgment

Q. Yao was in part supported by NSFC (No. 92270106) and CCF-Baidu Open Fund.

References

  • (1)
  • Baker et al. (2017) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. 2017. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 2 (2012).
  • Bharadhwaj (2019) Homanga Bharadhwaj. 2019. Meta-learning for user cold-start recommendation. In International Joint Conference on Neural Networks. 1–8.
  • Brockschmidt (2020) Marc Brockschmidt. 2020. GNN-FiLM: Graph neural networks with feature-wise linear modulation. In International Conference on Machine Learning. 1144–1152.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems. 7–10.
  • Dong et al. (2020) Manqing Dong, Feng Yuan, Lina Yao, Xiwei Xu, and Liming Zhu. 2020. MAMO: Memory-augmented meta-optimization for cold-start recommendation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 688–697.
  • Feng et al. (2021) Xidong Feng, Chen Chen, Dong Li, Mengchen Zhao, Jianye Hao, and Jun Wang. 2021. CMML: Contextual modulation meta learning for cold-start recommendation. In ACM International Conference on Information and Knowledge Management. 484–493.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. 1126–1135.
  • Gao et al. (2021a) Chen Gao, Yinfeng Li, Quanming Yao, Depeng Jin, and Yong Li. 2021a. Progressive feature interaction search for deep sparse network. In Advances in Neural Information Processing Systems, Vol. 34. 392–403.
  • Gao et al. (2021b) Chen Gao, Quanming Yao, Depeng Jin, and Yong Li. 2021b. Efficient Data-specific Model Search for Collaborative Filtering. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 415–425.
  • Ha et al. (2017) David Ha, Andrew Dai, and Quoc V Le. 2017. Hypernetworks. In International Conference on Learning Representations.
  • Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In International Conference on World Wide Web. 173–182.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989), 359–366.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In International Conference on World Wide Web. 193–201.
  • Hutter et al. (2019) Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated machine learning: methods, systems, challenges. Springer Nature.
  • Jin et al. (2019) Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-Keras: An efficient neural architecture search system. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1946–1956.
  • Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 426–434.
  • Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. MeLU: Meta-learned user preference estimator for cold-start recommendation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1073–1082.
  • Lin et al. (2013) Jovian Lin, Kazunari Sugiyama, Min-Yen Kan, and Tat-Seng Chua. 2013. Addressing cold-start in app recommendation: latent user models constructed from twitter followers. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 283–292.
  • Lin et al. (2021) Xixun Lin, Jia Wu, Chuan Zhou, Shirui Pan, Yanan Cao, and Bin Wang. 2021. Task-adaptive neural process for user cold-start recommendation. In The Web Conference. 1306–1316.
  • Liu et al. (2020) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, Jincai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020. AutoFIS: Automatic feature interaction selection in factorization models for click-through rate prediction. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2636–2645.
  • Liu et al. (2019) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In International Conference on Learning Representations.
  • Lu et al. (2020) Yuanfu Lu, Yuan Fang, and Chuan Shi. 2020. Meta-learning on heterogeneous information networks for cold-start recommendation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1563–1573.
  • Luo et al. (2018) Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2018. Neural architecture optimization. In Advances in Neural Information Processing Systems, Vol. 31. 7827–7838.
  • Pang et al. (2022) Haoyu Pang, Fausto Giunchiglia, Ximing Li, Renchu Guan, and Xiaoyue Feng. 2022. PNMTA: A pretrained network modulation and task adaptation approach for user cold-start recommendation. In The Web Conference. 348–359.
  • Park and Tuzhilin (2008) Yoon-Joo Park and Alexander Tuzhilin. 2008. The long tail of recommender systems and how to leverage it. In ACM Conference on Recommender Systems. 11–18.
  • Perez et al. (2018) Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence, Vol. 32. 3942–3951.
  • Pham et al. (2018) Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. 4095–4104.
  • Real et al. (2019) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, Vol. 33. 4780–4789.
  • Requeima et al. (2019) James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E Turner. 2019. Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, Vol. 32. 7957–7968.
  • Resnick and Varian (1997) Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56–58.
  • Schein et al. (2002) Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. 2002. Methods and metrics for cold-start recommendations. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 253–260.
  • Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders meet collaborative filtering. In International Conference on World Wide Web. 111–112.
  • Song et al. (2020) Qingquan Song, Dehua Cheng, Hanning Zhou, Jiyan Yang, Yuandong Tian, and Xia Hu. 2020. Towards automated neural interaction discovery for click-through rate prediction. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 945–955.
  • Volkovs et al. (2017) Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. DropoutNet: Addressing cold start in recommender systems. In Advances in Neural Information Processing Systems, Vol. 30. 4957–4966.
  • Wang et al. (2021) Li Wang, Binbin Jin, Zhenya Huang, Hongke Zhao, Defu Lian, Qi Liu, and Enhong Chen. 2021. Preference-adaptive meta-learning for cold-start recommendation.. In International Joint Conference on Artificial Intelligence. 1607–1614.
  • Wang et al. (2020) Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. Comput. Surveys 53, 3 (2020), 1–34.
  • Xie et al. (2021) Yuexiang Xie, Zhen Wang, Yaliang Li, Bolin Ding, Nezihe Merve Gürel, Ce Zhang, Minlie Huang, Wei Lin, and Jingren Zhou. 2021. FIVES: Feature interaction via edge search for large-scale tabular data. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3805.
  • Yang et al. (2020) Yibo Yang, Hongyang Li, Shan You, Fei Wang, Chen Qian, and Zhouchen Lin. 2020. ISTA-NAS: Efficient and consistent neural architecture search by sparse coding. In Cross attention network for few-shot classification, Vol. 33. 10503–10513.
  • Yao et al. (2020a) Quanming Yao, Xiangning Chen, James T Kwok, Yong Li, and Cho-Jui Hsieh. 2020a. Efficient neural interaction function search for collaborative filtering. In The Web Conference. 1660–1670.
  • Yao et al. (2020b) Quanming Yao, Ju Xu, Wei-Wei Tu, and Zhanxing Zhu. 2020b. Efficient neural architecture search via proximal iterations. In AAAI Conference on Artificial Intelligence, Vol. 34. 6664–6671.
  • Yu et al. (2019) Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. 2019. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations.
  • Yu et al. (2021) Runsheng Yu, Yu Gong, Xu He, Bo An, Yu Zhu, Qingwen Liu, and Wenwu Ou. 2021. Personalized adaptive meta learning for cold-start user preference prediction. In AAAI Conference on Artificial Intelligence, Vol. 35. 10772–10780.
  • Zhang et al. (2023) Yongqi Zhang, Quanming Yao, and James T Kwok. 2023. Bilinear scoring function search for knowledge graph learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 2 (2023), 1458–1473.
  • Zhao et al. (2021) Xiangyu Zhao, Haochen Liu, Wenqi Fan, Hui Liu, Jiliang Tang, and Chong Wang. 2021. AutoLoss: Automated loss function search in recommendations. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3959–3967.
  • Ziegler et al. (2005) Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005. Improving recommendation lists through topic diversification. In International Conference on World Wide Web. 22–32.

Appendix A Details of Model Structure

As mentioned in Section 3.2, existing models  (Dong et al., 2020; Lin et al., 2021; Pang et al., 2022) share an architecture consisting of three components: embedding layer EE, adaptation network AA, and predictor PP. Here we provide the details of these components used in ColdNAS.

A.1. Embedding Layer

We first embed user and item’s one-hot categorical features into dense vectors through the embedding layer. Taking uiu_{i} for example, we generate a content embedding for each categorical content feature and concatenate them together to obtain the initial user embedding. Given BB user contents, the embedding is defined as:

(11) 𝒖𝒊=[𝑾E1𝒄i1𝑾E2𝒄i2𝑾EB𝒄iB]\displaystyle\bm{u_{i}}=[~{}\bm{W}^{1}_{E}\bm{c}_{i}^{1}\mid{\bm{W}^{2}_{E}\bm{c}_{i}^{2}}\mid\cdots\mid{\bm{W}^{B}_{E}\bm{c}_{i}^{B}}~{}]

where [][~{}\cdot\mid\cdot~{}] is the concatenation operation, 𝒄ib\bm{c}_{i}^{b} is the one-hot vector of the bbth categorical content of uiu_{i} , and 𝑾Eb\bm{W}^{b}_{E} represents the embedding matrix of the corresponding feature in the shared user feature space. The parameters of embedding layer are collectively denoted as 𝜽E={𝑾Eb}b=1B\bm{\theta}_{E}=\{\bm{W}^{b}_{E}\}_{b=1}^{B}.

A.2. Adaptation Network

We first use a fully connected (FC) layer to get hidden representation of interaction history 𝒓i,j\bm{r}_{i,j}, which is calculated as

(12) 𝒓i,j=ReLU(𝑾A1[𝒖i𝒗jyi,j]+𝒃A1).\bm{r}_{i,j}=\text{ReLU}(\bm{W}^{1}_{A}[~{}\bm{u}_{i}\mid\bm{v}_{j}\mid y_{i,j}~{}]+\bm{b}^{1}_{A}).

We use mean pooling to aggregate the interactions in 𝒮i\mathcal{S}_{i} as the task context information of uiu_{i}, i.e.,preference, 𝒄i=1N1N𝒓i,j\bm{c}_{i}=\frac{1}{N}\sum_{1}^{N}\bm{r}_{i,j}. Then, we use another FC layer to generate the user-specific parameters as ϕi=𝑾A2𝒄i+𝒃A2\bm{\phi}_{i}=\bm{W}^{2}_{A}\bm{c}_{i}+\bm{b}^{2}_{A}. The parameters of adaptation network is denoted as 𝜽A={𝑾A1,𝒃A1,𝑾A2,𝒃A2}\bm{\theta}_{A}=\{\bm{W}^{1}_{A},\bm{b}^{1}_{A},\bm{W}^{2}_{A},\bm{b}^{2}_{A}\}.

A.3. Predictor

We use a LL-layer MLP as the predictor. Denote output of the llth layer hidden units as 𝒉l\bm{h}^{l}, the modulation function at the llth layer as MlM^{l}, we obtain prediction by

𝒉il\displaystyle\bm{h}^{l}_{i} =Mt(𝒉l,𝚽i),\displaystyle=M_{t}(\bm{h}^{l},\bm{\Phi}_{i}),
(13) 𝒉l+1\displaystyle\bm{h}^{l+1} =ReLU(𝑾Pl𝒉il+𝒃Pl),\displaystyle=\text{ReLU}(\bm{W}^{l}_{P}\bm{h}_{i}^{l}+\bm{b}^{l}_{P}),

where l[0,1,L1]l\in[0,1,\cdots L-1], 𝒉0=[𝒖i𝒗j]\bm{h}^{0}=[\bm{u}_{i}\mid\bm{v}_{j}] and y^i,j=𝒉L\hat{y}_{i,j}=\bm{h}^{L}. The parameters of predictor are denoted as 𝜽P={𝑾Pl,𝒃Pl}l=0L1\bm{\theta}_{P}=\{\bm{W}^{l}_{P},\bm{b}^{l}_{P}\}_{l=0}^{L-1}.

Appendix B Proof of the Proposition 3.1

Proof.

To prove Proposition 3.1, we need to use the two conditions below:

Condition 1: If ϕik\phi_{i}^{k} is the input of operation \odot or //, then every element in ϕik\phi_{i}^{k} is non-negative. We implement this as ReLU function.

Condition 2: The adaptation network AA is expressive enough, that it can approximate ϕ^i=ϕipop1ϕiq\hat{\phi}_{i}=\phi^{p}_{i}\circ_{\text{op}^{1}}\phi^{q}_{i}, where op1𝒪\circ_{\text{op}^{1}}\in\mathcal{O} and ϕip\phi^{p}_{i}, ϕiq𝚽i\phi^{q}_{i}\in\bm{\Phi}_{i} learned from the data. This naturally holds due to the assumption.

Divide op\circ_{\text{op}} into 4 groups: G1={max},G2={min},G3={+,},G4={,/}G_{1}=\{\max\},~{}G_{2}=\{\min\},~{}G_{3}=\{+,-\},~{}G_{4}=\{\odot,/\}. We can prove two important properties below.

Property 1: Inner-group consistence

.We can choose an operation in each group as the group operation gi\circ_{g_{i}}, e.g. g1=max,g2=min,g3=+,g4=\circ_{g_{1}}=\max,~{}\circ_{g_{2}}=\min,~{}\circ_{g_{3}}=+,~{}\circ_{g_{4}}=\odot. Then any number of successive operations that belong to the same group GiG_{i} can be associated into one operation gi\circ_{g_{i}}, i.e.,

𝒙opkϕikopk+1ϕik+1opk+mϕik+m=𝒙giϕ^ik,\bm{x}\circ_{\text{op}^{k}}\bm{\phi}^{k}_{i}\circ_{\text{op}^{k+1}}\bm{\phi}^{k+1}_{i}\cdots\circ_{\text{op}^{k+m}}\bm{\phi}^{k+m}_{i}=\bm{x}\circ_{g_{i}}\hat{\bm{\phi}}_{i}^{k},
s.t.opk,opk+1opk+mGis.t.~{}\circ_{\text{op}^{k}},\circ_{\text{op}^{k+1}}\cdots\circ_{\text{op}^{k+m}}\in G_{i}

It’s trivial to prove with condition 2, e.g., 𝒙+ϕi1ϕi2+ϕi3=𝒙+ϕ^i1\bm{x}+\bm{\phi}_{i}^{1}-\bm{\phi}_{i}^{2}+\bm{\phi}_{i}^{3}=\bm{x}+\hat{\bm{\phi}}_{i}^{1}, where ϕ^i1=ϕi1ϕi2+ϕi3\hat{\bm{\phi}}_{i}^{1}=\bm{\phi}_{i}^{1}-\bm{\phi}_{i}^{2}+\bm{\phi}_{i}^{3}.

Property 2: Inter-group permutation-invariance.

The operations in different groups are permutation-invariant, i.e.,

𝒙giϕikgjϕik+1=𝒙gjϕ^ikgiϕ^ik+1\bm{x}\circ_{g_{i}}\bm{\phi}^{k}_{i}\circ_{g_{j}}\bm{\phi}^{k+1}_{i}=\bm{x}\circ_{g_{j}}\hat{\bm{\phi}}^{k}_{i}\circ_{g_{i}}\hat{\bm{\phi}}^{k+1}_{i}

It is also trivial to prove with condition 1 and condition 2, e.g.,

max(𝒙,ϕi1)+ϕi2=max(𝒙+ϕ^i1,ϕ^i2),\max(\bm{x},~{}\bm{\phi}_{i}^{1})+\bm{\phi}_{i}^{2}=\max(\bm{x}+\hat{\bm{\phi}}_{i}^{1},~{}\hat{\bm{\phi}}_{i}^{2}),

where ϕ^i1=ϕi2\hat{\bm{\phi}}_{i}^{1}=\bm{\phi}_{i}^{2} and ϕ^i2=ϕi1+ϕi2\hat{\bm{\phi}}_{i}^{2}=\bm{\phi}_{i}^{1}+\bm{\phi}_{i}^{2}.

With the above two properties, we can recurrently commute operations until operations in the same group are gathered, and merge operations in the four groups respectively. So

𝒉il=𝒉lop1ϕi1op2ϕi2opCϕiC\bm{h}^{l}_{i}=\bm{h}^{l}\circ_{\text{op}^{1}}\bm{\phi}_{i}^{1}\circ_{\text{op}^{2}}\bm{\phi}_{i}^{2}\cdots\circ_{\text{op}^{C}}\bm{\phi}_{i}^{C}

equals to

𝒉il=𝒉lg1ϕ^i1g2ϕ^i2g3ϕ^i3g4ϕ^i4,\bm{h}^{l}_{i}=\bm{h}^{l}\circ_{g_{1}}\hat{\bm{\phi}}_{i}^{1}\circ_{g_{2}}\hat{\bm{\phi}}_{i}^{2}\circ_{g_{3}}\hat{\bm{\phi}}_{i}^{3}\circ_{g_{4}}\hat{\bm{\phi}}_{i}^{4},

and the four group operations are permutation-invariant.

Note that since the identity element (0 or 1) of each group operation can be learned from condition 2, a modulation function that doesn’t cover all groups can also be represented in the same form. ∎

Appendix C Experiments

C.1. Experiment Setting

Experiments were conducted on a 24GB NVIDIA GeForce RTX 3090 GPU, with Python 3.7.0, CUDA version 11.6.

C.1.1. Hyperparameter Setting

We find hyperparameters using the 𝒯val\mathcal{T}^{\text{val}} via grid search for existing methods. In ColdNAS, the batch size is 32, 𝑾Eb\bm{W}^{b}_{E} in (11) has size 32×|cib|32\times|c_{i}^{b}| where |cib||c_{i}^{b}| is the length of cibc_{i}^{b}. The dimension of hidden units in (13) is set as 𝒉1=128,𝒉2=64,𝒉3=32\bm{h}^{1}=128,\bm{h}^{2}=64,\bm{h}^{3}=32 for all three datasets. The learning rate β\beta is chosen from {5×106,1×105,5×105,1×104}\{5\times 10^{-6},1\times 10^{-5},5\times 10^{-5},1\times 10^{-4}\} and the dimension of 𝒓i,j\bm{r}_{i,j} in (12) is chosen from {128,256,512,1024}\{128,256,512,1024\}. The final hyperparameters chosen on the three benchmark datasets are shown in Table 6.

Table 6. The final hyperparameters chosen in ColdNAS.
Dataset MovieLens BookCrossing Last.fm
β\beta 5×1055\times 10^{-5} 5×1065\times 10^{-6} 1×1041\times 10^{-4}
𝒓i,j\bm{r}_{i,j} 1024 512 256
𝑾A1\bm{W}^{1}_{A} 1024×2891024\times 289 512×161512\times 161 256×65256\times 65
𝑾A2\bm{W}^{2}_{A} 2052×10242052\times 1024 1540×5121540\times 512 1156×2561156\times 256
𝒃A1\bm{b}^{1}_{A} 10241024 512512 256256
𝒃A2\bm{b}^{2}_{A} 20522052 15401540 11561156
𝑾P0\bm{W}^{0}_{P} 128×289128\times 289 128×161128\times 161 128×65128\times 65
𝑾P1\bm{W}^{1}_{P} 64×12864\times 128 64×12864\times 128 64×12864\times 128
𝑾P2\bm{W}^{2}_{P} 32×6432\times 64 32×6432\times 64 32×6432\times 64
𝑾P3\bm{W}^{3}_{P} 1×321\times 32 1×321\times 32 1×321\times 32
𝒃P0\bm{b}^{0}_{P} 128128 128128 128
𝒃P1\bm{b}^{1}_{P} 6464 6464 64
𝒃P2\bm{b}^{2}_{P} 3232 32 32
𝒃P3\bm{b}^{3}_{P} 1 1 1
maximum iteration number 50 50 100
Refer to caption
(a) Varying KK in Top-KK.
Refer to caption
(b) Varying number of layers of the predictor.
Refer to caption
(c) Varying support set size during inference.
Figure 8. Model sensitivity analysis on MovieLens.
Refer to caption
(a) Varying KK in Top-KK.
Refer to caption
(b) Varying number of layers of the predictor.
Refer to caption
(c) Varying support set size during inference.
Figure 9. Model sensitivity analysis on BookCrossing.

C.1.2. URLs of Datasets and Baselines

We use three benchmark datasets (Table 4): (i) MovieLens222https://grouplens.org/datasets/movielens/1m/ (Harper and Konstan, 2015): a dataset containing 1 million movie ratings of users collected from MovieLens, whose features include gender, age, occupation, Zip code, publication year, rate, genre, director and actor; (ii) BookCrossing333http://www2.informatik.uni-freiburg.de/~cziegler/BX/ (Ziegler et al., 2005): a collection of users’ ratings on books in BookCrossing community, whose features include age, location, publish year, author, and publisher; and (iii) Last.fm444https://grouplens.org/datasets/hetrec-2011/: a collection of user’s listening count of artists from Last.fm online system, whose features only consist of user and item IDs. In experiments, we compare ColdNAS with the following representative user cold-start methods: (i) traditional deep cold-start model DropoutNet555https://github.com/layer6ai-labs/DropoutNet (Volkovs et al., 2017) and (ii) FSL based methods include MeLU666https://github.com/hoyeoplee/MeLU (Lee et al., 2019), MetaCS (Bharadhwaj, 2019), MetaHIN777https://github.com/rootlu/MetaHIN (Lu et al., 2020), MAMO888https://github.com/dongmanqing/Code-for-MAMO (Dong et al., 2020), and TaNP999https://github.com/IIEdm/TaNP (Lin et al., 2021). MetaCS is very similar to MeLU, except that it updates all parameters during meta-learning. Hence, we implement MetaCS based on the codes of MeLU.

C.2. More Experimental Results

Here, we show empirical results of Section 4.4 of MovieLens and BookCrossing in Figure 8 and 9. The results of all three datasets show similar patterns and the same analysis and conclusions in Section 4.4 are applicable.

C.3. Complexity Analysis

Denote the layer number of adaptation network as LAL_{A}, the layer number of predictor as LPL_{P}, the support set size as NN, and the query set size as MM. For notation simplicity, we denote the hidden size of each layer is the same as DD. In the search phase (Algorithm 1 step 2\sim10), the calculation of 𝚽i\bm{\Phi}_{i} of a task is O(NLAD3)O(NL_{A}D^{3}), and predicting each item costs O(LPD3)O(L_{P}D^{3}), so the average complexity is O((LP+NLAM)D3)O((L_{P}+\frac{NL_{A}}{M})D^{3}). Similarly, in the retrain (Algorithm 1 step 12\sim13), the time complexity is also O((LP+NLAM)D3)O((L_{P}+\frac{NL_{A}}{M})D^{3}). In total, the time complexity is O((LP+NLAM)D3)O((L_{P}+\frac{NL_{A}}{M})D^{3}).