Visually Aware Skip-Gram for Image Based Recommendations

Parth Tiwari IIT Kharagpur1 Thørväld CircleKharagpurIndia [email protected] , Yash Jain IIT Kharagpur1 Thørväld CircleKharagpurIndia [email protected] , Shivansh Mundra IIT Kharagpur1 Thørväld CircleKharagpurIndia [email protected] , Jenny Harding Loughborough UniversityLoughboroughUnited Kingdom and Manoj Kumar Tiwari National Institute of Industrial EngineeringMumbaiIndia

(2020)

Abstract.

The visual appearance of a product significantly influences purchase decisions on e-commerce websites. We propose a novel framework VASG (Visually Aware Skip-Gram) for learning user and product representations in a common latent space using product image features. Our model is an amalgamation of the Skip-Gram architecture and a deep neural network based Decoder. Here the Skip-Gram attempts to capture user preference by optimizing user-product co-occurrence in a Heterogeneous Information Network while the Decoder simultaneously learns a mapping to transform product image features to the Skip-Gram embedding space. This architecture is jointly optimized in an end-to-end, multitask fashion. The proposed framework enables us to make personalized recommendations for cold-start products which have no purchase history. Experiments conducted on large real world datasets show that the learned embeddings can generate effective recommendations using nearest neighbour searches.

Recommender Systems, Skip-Gram, Representation learning, Multitask learning, Cold-Start, Image Features

^†^†copyright: acmcopyright^†^†journalyear: 2020^†^†conference: Under review; 2020; Earth^†^†price: 15.00^†^†ccs: Computing methodologies Image representations^†^†ccs: Applied computing Online shopping^†^†ccs: Computing methodologies Unsupervised learning^†^†ccs: Computing methodologies Multi-task learning

1. Introduction

Recommender Systems have been described as “an intuitive line of defense against consumer over-choice” (Zhang et al., 2019). With the rapid growth in the number of products available on e-commerce websites, this problem of ”over-choice” is becoming more and more significant, especially in the case of clothing related products. In fact, fashion recommendation has attracted considerable attention in recent literature. Studies tackling this problem have shown improved performance by incorporating visual information into their recommendation procedure. Hence it is safe to conclude that the visual appearance of a product plays a significant role while making purchase decisions.

Image-based recommendation systems can be broadly classified into two groups - (i) Systems which incorporate images into a product rating/rank prediction function. These systems build upon well known Latent Factor models like Singular Value Decomposition (SVD) or Bayesian Personalized Ranking (BPR). (Chen et al., 2019; He and McAuley, 2016; He et al., 2016b, a; Chen et al., 2017) (ii) Systems which learn latent representations (a.k.a embeddings) using image features and subsequently use embedding distance for making recommendations. These systems rely on extracting fine-grained embeddings which can capture user preference and/or product similarity (Guo et al., 2018; McAuley et al., 2015; Ying et al., 2018; Shankar et al., 2017; Jiang et al., 2016). Beyond images, information coming from product metadata (Vasile et al., 2016), user-product information networks (Shi et al., 2019), review-text (Guo et al., 2018; Cheng et al., 2019) etc. has also been leveraged for learning effective representations.

Making recommendations in the cold-start setting is a key area where addition of visual information has shown significant performance improvement (He and McAuley, 2016). However, we observe that there is considerable variability in the definition of cold-start setting across literature. Some studies describe cold-start products as items that have fewer instances of positive feedback in the training set (He and McAuley, 2016; He et al., 2016b; Shi et al., 2019; Chen et al., 2017; Liu et al., 2020), whereas (Vasile et al., 2016) describes its cold-start scenario when certain product-pairs have zero co-occurrences while training. Recommending completely new products (i.e. products which are unseen during the training step) is a tougher problem which is studied less frequently (Schein et al., 2002; Saveski and Mantrach, 2014; Gantner et al., 2010).

Systems falling under category (i) (as described above) rely on user and product latent factors coming from interaction records, hence, they are incapable of handling new products unless product latent factors are estimated by other methods. Systems falling under category (ii) show better promise of handling this problem. Methods like (Shankar et al., 2017; McAuley et al., 2015), which make recommendations based on product-product similarity are capable of extracting latent representations for a new product. However, here user preference is ignored while feature extraction hence making personalized recommendations is not possible. Methods which attempt to capture preferences for individual users are better suited for handling this problem.

To this end, we propose a novel framework which jointly learns (i) user and product embeddings in a common latent space and (ii) a mapping function which transforms product image features to the same latent space. In specific, we augment the Skip-Gram model (Mikolov et al., 2013b) with a decoder architecture where the decoder uses the product Skip-Gram embeddings to reconstruct their corresponding image features. Here Skip-Gram maximizes the probability of co-occurrence of users and products in a Heterogeneous Information Network (HIN) while the decoder minimizes the image feature reconstruction loss. This decoder serves two purposes- (i) it incorporates visual information into the skip gram embeddings and (ii) it is later used for learning a mapping function which transforms product image features to the Skip-Gram embedding space. The learning objective of this architecture is dependent on the genre of input (users or products) observed while training. The Skip-Gram model is optimized in case of both users and products whereas the decoder is updated only for products in a multitask fashion. We term this model as VASG (Visually Aware Skip-Gram).

The mapping from the product image feature space to VASG embedding space is subsequently learned and can be used for finding effective embeddings for cold-start products. VASG embeddings reflect user preference while simultaneously capturing a product’s purchase history and visual appearance. Since users and products are represented in the same latent space, personalized recommendations can be made directly by searching for products in the user’s vicinity. Our contributions can be summarized as -

•

We propose a novel multitasking Skip-Gram architecture, VASG, which is trained on two different objectives depending on the genre of the input (users or products). This architecture enables us to learn a mapping which transforms product image features to the learned embedding space.
•

Extensive experiments are performed on real world datasets and VASG embeddings are compared to state-of-the-art recommendation systems. The performance for unseen cold start products is also studied.
•

We analyze the learned embeddings in detail to shed some light on the information captured by them.

Refer to caption — Figure 1. The VASG framework. (a) User-Product interactions are represented in a HIN. (b) Sequences are generated by performing random walks on the HIN. The Skip-Gram context window is used to describe the neighbourhood of a node. (c) The Skip-Gram model is trained on the generated sequences. (d) The decoder $\mathcal{M}^{-1}(.)$ attempts to reconstruct product image features from the Skip-Gram embeddings. (e) An autoencoder is trained to estimate the mapping $\mathcal{M}(.)$

2. Related Work

Our work relies on several advancements in image feature extraction methods, representation learning techniques in graphs and multitask learning.

With advancements in deep learning, VBPR (He and McAuley, 2016) established that the addition of visual signals can be useful for making recommendations. VBPR adds image features extracted from a pre-trained convolutional neural network to the BPR framework. Similarly, (Chen et al., 2017) extracts more fine grained image features using attention modules and incorporates them in collaborative filtering. Images have also been proven to be useful for Point-of-Interest recommendations in (Wang et al., 2017) and for tag recommendations in (Rawat and Kankanhalli, 2016).

More recently, there has been a surge of methods which combine information coming from different modalities and metadata with images (Cheng et al., 2019; He et al., 2016b; Guo et al., 2018; He et al., 2016a; Chen et al., 2019). Attention mechanisms which utilize textual reviews have been used for finding specific parts of an image where users are most interested in (Chen et al., 2019). Reviews and image features have also been used for learning user preference which is then integrated into a rating-based matrix factorization model (Cheng et al., 2019). Pre-defined product categories and hierarchy trees have been leveraged for learning hierarchical product embeddings (He et al., 2016b).

Recommendation Systems which rely on product image similarity; for example methods which address the Street-to-shop recommendation problem (Liu et al., 2012); have received a tremendous boost with improvements in metric learning methods (Gajic and Baldrich, 2018; Jiang et al., 2016; Shankar et al., 2017) . These methods use triplet loss to learn embeddings from product images which are robust to change in background, pose, lighting conditions etc. and hence can be used for making cross-domain recommendations. However, learning both user and product representations in a joint latent space is less frequently explored in existing literature. (Gong et al., 2020) embeds users and topics to the same low dimensional space to capture of their mutual dependency while (Guo et al., 2018) attempts to bring users, products and product search queries to the same latent space using both textual and visual modalities.

Heterogeneous Information Networks (HINs) can naturally model complex user and product interactions. Traditionally, meta-path based similarity and link predictions have been used for making recommendations in networks. With the development of graph representation learning methods (Perozzi et al., 2014; Grover and Leskovec, 2016; Dong et al., 2017), network embedding based recommendation systems have shown improved performance. The study (Shi et al., 2019) learns user and product representations from a HIN and incorporates them into the SVDFeature (Chen et al., 2012) framework. This idea of learning network representations has also been extended by using multitask learning for jointly optimizing the tasks of recommendation and link prediction in a HIN (Li et al., 2020). Multitask learning is a training regime where multiple objectives are optimized simultaneously. It has proven to provide performance boosts in works like (Gidaris et al., 2019; Xu et al., 2020).

A few studies which are relevant to our proposed approach should be discussed here. Learning attribute-to-feature mappings has been proposed in (Gantner et al., 2010). Here k-nearest neighbours, least squares approximation is used for approximating latent factors for new users or products. A large scale recommender system which uses images along with the HIN structure is proposed in (Ying et al., 2018). Here highly efficient Graph-Convolutional Neural Networks are used on random walks generated from the HIN and recommendations are then made using similarity between the learned embeddings. Contrastive Predictive Coding (Oord et al., 2018) (CPC) is a generic representation learning regime which uses a similar training procedure as the proposed framework. CPC uses encoders to map raw data to latent representations and subsequently trains auto-regressive models using a contrastive loss. VASG differs from CPC in terms of usage of a decoding architecture instead of an encoder; usage of Skip-Gram instead of an auto-regressive model; and training using auxiliary supervision instead of self-supervision.

To the best of our knowledge, the model architecture proposed in this paper has not been studied before. In addition, we also make a novel attempt to bring unseen cold start products to the same latent space as existing users and products using product image features.

3. VASG Framework

Consider a set of users $\mathcal{U}$ and a set of products $\mathcal{P}$ where each user $u\in\mathcal{U}$ has interacted with a subset of products $\mathcal{P}\textsubscript{\it u}\subset\mathcal{P}$ . Each product $p\in\mathcal{P}$ is associated with a visual feature $f\textsubscript{p}\in\mathbb{R\textsuperscript{I}}$ which is extracted from its image using a Deep Convolutional Neural Network. Only a subset of products $\mathcal{P}\textsubscript{warm}\subset\mathcal{P}$ is observed during training and is termed as the warm start set. The set of products which is never observed during training is termed as completely cold set $~{}\mathcal{P}\textsubscript{cold}\subset\mathcal{P}$ . It is to be noted that $\mathcal{P}\textsubscript{cold}\cap\mathcal{P}\textsubscript{warm}=\phi$ .
Our objective is to learn (i) a D dimensional latent representation $\mathcal{X}\in\mathbb{R}\textsuperscript{\it D}$ for each $u\in\mathcal{U}$ and $p\in\mathcal{P}\textsubscript{warm}$ ; and (ii) a mapping function $\mathcal{M}(f\textsubscript{p}):\mathbb{R}\textsuperscript{\it I}\rightarrow\mathbb{R}\textsuperscript{\it D}$ . This function is used for finding embeddings of products that have no purchase history. The learned embeddings are expected to represent characteristic features of users and products which can be exploited for making representations. The VASG framework achieves the above mentioned objectives in two steps-

•

The first step involves training a Skip-Gram model on sequences of users and products. Here, the Skip-Gram architecture is modified by adding an auxiliary objective of learning an inverse mapping function $\mathcal{M}\textsuperscript{-1}(\mathcal{X}\textsubscript{p}):\mathbb{R}\textsuperscript{\it D}\rightarrow\mathbb{R}\textsuperscript{\it I}$ . The D dimensional embeddings corresponding to users $u\in\mathcal{U}$ and products $p\in\mathcal{P}\textsubscript{warm}$ are learned during this step.
•

The second step involves learning the mapping function $\mathcal{M}(f\textsubscript{p}):\mathbb{R}\textsuperscript{\it I}\rightarrow\mathbb{R}\textsuperscript{\it D}$ . This is achieved by training a deep encoder-decoder architecture using the inverse mapping $\mathcal{M}\textsuperscript{-1}(.)$ learned in the previous step.

We describe the network structure used for representing user-product interactions before describing the model architecture.

3.1. Heterogeneous Information Network

Skip-Gram (Mikolov et al., 2013a) is an unsupervised representation learning model for words in a text corpus. The representation corresponding to each word is learned by maximizing the probability of correctly predicting the context surrounding the given word. The proposed framework builds upon the Skip-Gram architecture, hence, it requires sequences of user and products. Taking inspiration from (Shi et al., 2019), we generate these sequences by performing random walks on a bipartite Heterogeneous Information Network (HIN) of users and products.

Formally, our HIN is a graph $G=(V,E)$ where $V=\mathcal{U}\cup\mathcal{P}\textsubscript{warm}$ is the set of nodes. $E=\{(u,p)|p\in\mathcal{P}\textsubscript{u}\cap\mathcal{P}\textsubscript{warm}~{};~{}u\in\mathcal{U}\}$ is the set of edges connecting the nodes where each edge connects a user to the products purchased by him. Sequences are generated in the following manner - Starting from a node $\mathrm{v}\in V$ , $n$ number of random walks, each of length $wl$ are performed on the HIN. Every subsequent node in a walk is chosen randomly from the direct neighbours of the current node. For example, consider the sample HIN shown in Figure 1(a). A sequence starting from node u₁ could be u₁ p₂ u₂ p₂ u₃ … repeated for $wl$ nodes. Here, $n$ and $wl$ are hyper-parameters which determine the size of the corpus generated for training. It is to be noted that the bipartite structure of the HIN ensures that users and products appear in an alternating fashion in each random walk. The neighbouring context of a node, $\mathcal{N}(\mathrm{v})$ is determined by sliding a fixed size context window over the generated sequences. We discuss the effect of the context window size on the learned representations in Section 6.

Table 1. Dataset statistics (after preprocessing)

Dataset	#products	#users	#ratings
Women	116,971	14,110	307,863
Men	62,376	19,618	187,181
Shoes	71,219	22,066	211,818
Jewelry	62,958	13,975	161,439

3.2. Visually Aware Skip-Gram

Consider an HIN node $\mathrm{v}\in V$ and its neighbouring context $\mathcal{N}(\mathrm{v})$ . We can learn the the representations corresponding to each node in the HIN by optimizing the Skip-Gram objective -

(1)

\underset{\theta}{arg\,max}\displaystyle\sum_{\mathrm{v}\in V}\displaystyle\sum_{c\in N(\mathrm{v})}log(p(c|v;\theta))

Here $p(c|v;\theta)$ is the commonly used softmax probability, given by: $p(c|v,\theta)=\frac{e^{\mathcal{X}\textsubscript{c}\mathcal{X}\textsubscript{v}}}{\sum_{n\in V}e^{\mathcal{X}\textsubscript{\it n}\mathcal{X}\textsubscript{v}}}$ where $\mathcal{X}\textsubscript{v}\in\mathbb{R}\textsuperscript{D}$ is the $D$ dimensional latent representation corresponding to node $\mathrm{v}$ . Mikolov et al. (Mikolov et al., 2013b) introduces negative sampling for efficiently estimating the softmax probabilities however, generating effective negative samples can be an expensive process (Jain et al., 2019). We expect the negative samples to highlight the contrast in a product’s visual appearance and a user’s purchase preference simultaneously. In our experiments, we are able to obtain meaningful results without negative sampling, hence we approximate the softmax probability as: $\sigma(\mathcal{X}\textsubscript{c}~{}.~{}\mathcal{X}\textsubscript{v})$ where $\sigma$ is the $sigmoid$ function, given by: $\sigma(x)=\frac{1}{1+e^{-x}}$ .

We add the auxiliary objective of learning $\mathcal{M}\textsuperscript{-1}(\mathcal{X}\textsubscript{p}):\mathbb{R}\textsuperscript{\it D}\rightarrow\mathbb{R}\textsuperscript{\it I}$ to the Skip-Gram objective whenever $\mathrm{v}\in\mathcal{P}\textsubscript{warm}$ . The function $\mathcal{M}\textsuperscript{-1}(.)$ attempts to reconstruct the DeepCNN image feature $f\textsubscript{p}$ of a product from its latent representation $\mathcal{X}\textsubscript{p}$ . This inverse mapping function is approximated by training a decoder architecture with the objective of minimizing the mean squared error of image feature reconstruction -

(2)

\underset{\mathcal{M}^{-1}}{arg\,min}\displaystyle\sum_{\mathrm{v}\in\mathcal{P}\textsubscript{warm}}||f\textsubscript{v}-f^{{}^{\prime}}_{\textsubscript{p}}||^{2}_{2}

Where $f^{{}^{\prime}}_{\textsubscript{p}}=\mathcal{M}\textsuperscript{-1}(\mathcal{X}\textsubscript{p})$ . It is to be noted that the auxiliary task is optimized only when $\mathrm{v}\in\mathcal{P}\textsubscript{warm}$ . This implies that users have a single loss function while products have two different loss functions which are optimized simultaneously in a multi-task learning fashion. When a node $\mathrm{v}$ is observed in the training corpus, the loss can be written as -

(3)

\mathcal{L}(\mathrm{v})=\begin{cases}-log(\sigma(\mathcal{X}\textsubscript{c}\mathcal{X}\textsubscript{v}))&\text{ $\mathrm{v}\in\mathcal{U}$}\\ w\textsubscript{1}(-log(\sigma(\mathcal{X}\textsubscript{c}\mathcal{X}\textsubscript{v})))+w\textsubscript{2}||f\textsubscript{v}-f^{{}^{\prime}}_{\textsubscript{p}}||^{2}_{2}&\text{$\mathrm{v}\in\mathcal{P}\textsubscript{warm}$}\par\end{cases}

Here $w\textsubscript{1}$ and $w\textsubscript{2}$ are trainable weights used for combining the Skip-Gram and image reconstruction losses. We follow the approach introduced in (Kendall et al., 2017) for learning these weights.

Once the embeddings and the inverse mapping $\mathcal{M}^{-1}(.)$ have been learned, the function $\mathcal{M}(.)$ is approximated by training a deep autoencoder for reconstructing the product image features for each product $p\in\mathcal{P}\textsubscript{warm}$ . The autoencoder has $\mathcal{M}^{-1}(.)$ as the decoder while the encoder attempts to approximate $\mathcal{M}(.)$ . Weight updates are not allowed in the decoder, therefore $\mathcal{M}^{-1}(.)$ remains unchanged during the training process. An overview of the proposed architecture is presented in Figure 1.
Implementation Details: We implemented our model in Pytorch. The decoder architecture $\mathcal{M}^{-1}(.)$ consists of 5 fully connected layers with [256, 512, 1024, 2048, $I$ ] neurons respectively ( $I=4096$ ). Each layer has reLu activation and dropout regularization with dropout probability set to 0.5. The encoder $\mathcal{M}(.)$ follows the same architecture. Adam optimizer with learning rate 1e-3 is used for parameter updates. To fully utilize GPU acceleration, we use a batch wise implementation and ensure each batch homogeneously comprises either users or products as we iterate over the Skip-Gram corpus. Training on our largest dataset requires around 30 minutes on a Nvidia Quadro P5000 GPU. We provide our implementation here¹¹1Link to repository. Will be released at the time of publication.

4. Experiments

We perform experiments on real-world datasets to evaluate the VASG embeddings. Our experiments attempt to answer the following research questions - (i) Can the embeddings generate personalized recommendations for each user? (ii) Can the embeddings identify product to product relationships? (iii) What are the properties displayed by the learned embedding space? (vi) How do different components and hyper-parameters of the proposed framework affect its performance?

4.1. Datasets

Four subcategories under the ”Clothing, Shoes & Jewelry” dataset from Amazon Product Reviews (McAuley et al., 2015) are used for our experiments. The selected subcategories are - Men, Women, Shoes and Jewelry. The dataset is pre-processed to remove users that do not have sufficient purchase history. Users with $|\mathcal{P}\textsubscript{u}|<5$ are discarded from the datasets . Users’ rating history and product image features are used in our framework. Product metadata is also available, however it not utilized for learning embeddings. Only image URLs are used for retrieving images for the purpose of visualization. Product image features were extracted using a pre-trained Caffe reference model (Jia et al., 2014), with 5 convolutional layers followed by 3 dense layers. The feature vector is obtained by taking the output of the second dense layer (FC7) of this network and has length $I=4096$ . The details of the datasets are specified in Table 1.

The training and test sets are created by following the leave one out protocol described by (He and McAuley, 2016). For each user $u$ , one random product rated by the user is used for testing, while the remaining products are used for training. Two subsets of the test set are created for testing in warm start and completely cold start setting: $\mathcal{T}\textsubscript{warm}=\{(u,p)\mid u\in\mathcal{U}~{};~{}p\in\mathcal{P}\textsubscript{warm}\}$ and $\mathcal{T}\textsubscript{cold}=\{(u,p)\mid u\in\mathcal{U}~{};~{}p\in\mathcal{P}\textsubscript{cold}\}$ . Here $\mathcal{T}\textsubscript{warm}$ and $\mathcal{T}\textsubscript{cold}$ are approximately equal in size.

Table 2. AUC values on the test sets. The best scores are boldfaced. Refer Section 4.2

Dataset	Setting	RAND	WBOI	VBPR	VASG
Women	Warm Products	0.4998	0.6134	0.7982	0.8964
	Cold Products	0.5001	0.5912		0.7596
Men	Warm Products	0.4897	0.6468	0.7753	0.8758
	Cold Products	0.4954	0.6094		0.7061
Shoes	Warm Products	0.5023	0.6318	0.8116	0.8972
	Cold Products	0.4995	0.5846		0.7261
Jewelry	Warm Products	0.4987	0.6087	0.7629	0.8801
	Cold Products	0.5003	0.5704		0.7134

4.2. Making Personalized Recommendations

VASG embeddings can be used for generating recommendations by searching for relevant products in the vicinity of corresponding users in the learned latent space. We evaluate this ability to generate personalized recommendations in a Bayesian Personalized Ranking (BPR) scenario. BPR based methods (He and McAuley, 2016; Rendle et al., 2012) learn a rating prediction function by optimizing pairwise rankings of products w.r.t. to users. Predicted rankings are then evaluated using the well known AUC (Area Under ROC Curve) metric:

(4)

AUC=\frac{1}{|\mathcal{T}\textsubscript{test}|}\sum_{{}u\in\mathcal{T}\textsubscript{test}}\frac{1}{|\mathcal{P}-\mathcal{P}\textsubscript{u}|}\sum_{{}p\textsubscript{j}\in\mathcal{P}-\mathcal{P}\textsubscript{u}}\textbf{1}((u,p\textsubscript{test})>(u,p\textsubscript{j}))

where $\mathcal{T}\textsubscript{test}$ corresponds to $\mathcal{T}\textsubscript{warm}$ or $\mathcal{T}\textsubscript{cold}$ and $u,p\textsubscript{test}\in\mathcal{T}\textsubscript{test}$ . Here, $\textbf{1}(.)$ is an indicator function which evaluates if $p\textsubscript{test}$ has been ranked higher than $p\textsubscript{j}$ for user $u$ . BPR based methods rank products using the learned rating prediction function. We evaluate VASG embeddings for ranking products using the cosine similarity between corresponding user and product embeddings.

4.2.1. Baselines

We compare the proposed embeddings against the raw image features and against a state-of-the-art visually aware BPR method:

•

RAND (Random) - Product rankings are decided randomly for all users.
•

WBOI (Weighted Bag Of Images) - User embeddings are directly computed using the image features corresponding to the products purchased by him. The rating weighted mean of image features is used as the user embedding and ranks are computed using cosine similarity as described above.
•

VBPR (Visual Bayesian Product Ranking) - Introduced by (He and McAuley, 2016), VBPR incorporates visual factors to the BPR framework by using product image DeepCNN features in its rating prediction function.

4.2.2. Results

The entire pipeline for VASG with $D=100$ is run 5 times and the averaged results are reported in Table 2. The results for VBPR are calculated using 20 latent factors and 100 visual factors. Results are reported on the set $\mathcal{T}\textsubscript{warm}$ (Warm Products) and for on $\mathcal{T}\textsubscript{cold}$ (Cold Products) separately. VASG embeddings for cold start products are found using the mapping $\mathcal{M}(.)$ . Since latent factors for cold start products are not available in VBPR, computing its results on the set $\mathcal{T}\textsubscript{cold}$ is infeasible.

VASG embeddings show an average improvement of 3.84% over VBPR in the warm start setting. The AUC scores drop in the cold start setting, however they are significantly better than the WBOI baseline. This drop in performance can be explained by imperfections in the approximation of the mapping function $\mathcal{M}(.)$ .

4.3. Identifying Product Relationships

We expect the embeddings corresponding to co-purchased products to be similar. The evaluation strategy used by McAuley et al. (IBR) (McAuley et al., 2015) is followed for identifying product relationships using VASG embeddings. IBR uses product DeepCNN image features to learn a parametric distance function by maximizing the probability of correctly identifying co-purchased product pairs. Formally, two products $p\textsubscript{i}$ and $p\textsubscript{j}$ share a relation $R\textsubscript{ij}$ if they have been purchased by the same user. This relation is termed as an ”also bought” relation in IBR. We attempt to correctly identify co-purchased product pairs using the cosine similarity between the VASG embeddings. The test set is restructured in the following manner for reporting the results - A set of positive relation pairs $\mathcal{R}=\{(p\textsubscript{i},p\textsubscript{j})\mid\exists~{}R\textsubscript{ij}\}$ and a set of negative relation pairs $\mathcal{Q}=\{(p\textsubscript{i},p\textsubscript{j})\mid\nexists~{}R\textsubscript{ij}\}$ is constructed such that $|\mathcal{R}|=|\mathcal{Q}|~{};~{}p\textsubscript{i}\in\mathcal{P}$ . A positive relation is identified when $cosine~{}similarity(\mathcal{X}\textsubscript{i},\mathcal{X}\textsubscript{j})>t$ , where $\mathcal{X}\textsubscript{i}$ is the VASG embedding corresponding to $p\textsubscript{i}$ and $t$ is a manually assigned threshold. The accuracy of identifying co-purchased product pairs is reported on the combined set $\mathcal{R}\cup\mathcal{Q}$ .

Table 3. Accuracy of predicting product relations. Refer Section 4.3

Dataset

INN

IBR

VASG

vs INN

VASG

vs IBR

Women

67.82%

87.43%

79.59%

17%

-10%

Men

66.23%

87.13%

80.12%

21%

-9%

Shoes

69.34%

89.67%

81.56%

18%

-10%

Jewelry

65.42%

83.25%

79.36%

21%

-5%

4.3.1. Baselines

Since $|\mathcal{R}|=|\mathcal{Q}|$ , random classification is 50% accurate. In addition, the performance of VASG embeddings is compared to the following methods:

•

INN (Image Nearest Neighbours) - Cosine similarity between the DeepCNN image features $f\textsubscript{i},f\textsubscript{j}$ are used for identifying relationships in the same manner as described above.
•

IBR (Image Based Recommendations) - As described above, this method is trained on all co-purchased product pairs occurring in the training set.

4.3.2. Results

The entire pipeline for VASG with $D=100$ is run 5 times and the averaged results are reported in Table 3. For IBR, the rank of the Mahalanobis transform is set to 100 for fair comparison. Here, VASG embeddings show an average improvement of 19% over INN, however they do not outperform IBR which shows significantly better results across all datasets. However it should be noted that IBR is trained for specifically identifying all product-product relations (McAuley et al., 2015) whereas VASG embeddings are not. Our method learns embeddings by capturing co-occurrence of users and products in a small network neighbourhood which is more suited for making personalised recommendations which is shown by the results in Section 4.2

5. Analysis of Embeddings

Embedding Visualization. We visualize the learned embeddings in lower dimensions to understand the information captured. We plot 2D projections of a random set of products from each dataset in Figure 2(a). Here dimensions are reduced using Principal Component Analysis (PCA). A variance in the visual appearance of the products can be seen when we travel along the axes in these plots. For example in the Men dataset, the x-axis is populated with watches while the y-axis contains clothing related items like t-shirts. A similar trend can be observed for the Women and Shoes dataset, however, the plot for the Jewelry dataset looks noisy. Next, we visualize groups of products which have been purchased by specific users in Figure 2(b). Here we select 4 random users and plot all the products purchased by them. Dimensions are reduced using t-SNE with the perplexity parameter set to 30. Products corresponding to a specific user are marked by the colour of the frame around them. In case of co-purchased products, the frame colour is chosen randomly. It can be seen that products purchased by the same user have been clustered together. These plots show that the leaned embeddings can simultaneously capture the visual properties and the purchase history of a product.

Linear relationships. Skip-Gram word embeddings are known to follow a linear algebraic structure where embeddings display associative properties (shown by the famous example vector(”King”) - vector(”Man”) + vector(”Woman”) $\approx$ vector(”Queen”) (Mikolov et al., 2013a)). We attempt to exploit a similar structure in the VASG embeddings for recommending products. Consider a pair: $\{(u1,p1),p1\in\mathcal{P}\textsubscript{u1}\}$ and a user $u2;u2\neq u1$ , keeping $(u1,p1)$ and $u2$ fixed, if we can a find a product $p\in\mathcal{P}$ such that

cosine~{}similarity(\mathcal{X}\textsubscript{p1}-\mathcal{X}\textsubscript{u1}+\mathcal{X}\textsubscript{u2},\mathcal{X}\textsubscript{p})\approx 1

Then we say that $p$ is a user specific recommendation generated for the user $u2$ having a query product $p1$ . To test this methodology, we sample 10,000 pairs of $u1,u2$ and find the nearest neighbour of the point $\mathcal{X}\textsubscript{p1}+\mathcal{X}\textsubscript{u2}-\mathcal{X}\textsubscript{u1}$ , using cosine distance, from the set $\mathcal{P}$ and treat them as recommendations generated by VASG. We obtained a precision@5 of 89.7% in the Men dataset. Figure 3. shows some of the recommendations generated by this methodology. Approximately speaking, here a product $p$ is recommended for user $u2$ when the similarity between $p,p1$ and between $p,u2$ is high and when similarity between $u1,p$ is low.
Embedding similarity distribution. This analysis is inspired by (Ying et al., 2018). Effectiveness of the learned embeddings can be judged by the distribution of distances between random pairs of embeddings. A wide distribution indicates that the embedding space has enough “resolution” to capture relevance of different product pairs. Figure 4 plots the distribution of cosine similarities between pairs of products from the Men dataset. The distribution coming from the VASG embeddings is compared against the distribution computed from product image features. VASG embeddings show a wider distribution hence proving their superiority over using raw image features. The kurtosis of this distribution for VASG embeddings is 0.87, compared to 3.67 for image features.

6. Sensitivity Analysis

Influence of Embedding size. We study the variation of AUC scores with the embedding dimension size $D$ . Figure 4 shows AUC scores for warm-start products when evaluating the performance of making personalized recommendations. The performance improves with increase in dimension size, however the rate of improvement seems to slow down after $D=70$ .
Influence of noise addition. Autoencoders are known to suffer from over fitting. To overcome this problem, we add a small amount of Gaussian noise to the product image features while learning the mapping $\mathcal{M}(.)$ . We find that noise addition helps in improving the performance of cold-start products for making personalized recommendations by an average margin of 1.8% in AUC score.
Influence of Skip-Gram window Size. The window size determines the neighbourhood size of a HIN node. For example a window size of 5 restricts the neighborhood to second degree neighbours. A smaller context window exposes the embeddings to local HIN structures which is suited for making personalized recommendations. We experiment with window size [3,5,7,9] and find that a window size of 7 gives the best results. Smaller window sizes lead to a faster convergence, but the embeddings over-fit in a small neighbourhood whereas larger window sizes make the training corpus unnecessarily large.
Influence of the auxiliary decoder. We study how VASG’s performance is affected by removing the auxiliary decoder. This reduces the proposed architecture to the general Skip-Gram model. We evaluate the performance of making personalized recommendations in the set $\mathcal{T}\textsubscript{warm}$ . The performance drops by an average of 2.4% on the Women, Men and Shoes dataset while the performance remains almost consistent for the Jewelry dataset. This indicates that addition of visual features improves the convential Skip-Gram embeddings.

7. Conclusion

In this paper, we proposed a novel embedding learning architecture, VASG, which incorporates visual features into the Skip-Gram model. This model has two different loss functions depending on whether a user or a product is observed during training. VASG embeddings can be used for making personalized recommendations using nearest neighbour searches in the learned latent space. In addition, this framework enables us to find embeddings for cold start products which have never been observed during training. The embeddings are trained on real world datasets and extensive analysis brings forward several interesting properties captured by them.

There are multiple potential directions in which this work can be extended. Use of effective negative samples while training can enable us to learn a more refined latent space. Adversarial training can be used to reconstruct product images using Generative Adversarial Networks conditioned on the Skip-Gram embeddings. Additional auxiliary tasks which bring information from other modalities can be added to central Skip-Gram architecture. VASG embeddings can be also be incorporated into frameworks like SVDFeature for predicting ratings.

References

(1)
Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, 335–344.
Chen et al. (2012) Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and Yong Yu. 2012. SVDFeature: A Toolkit for Feature-based Collaborative Filtering. J. Mach. Learn. Res. 13, 1 (Dec. 2012), 3619–3622.
Chen et al. (2019) Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations Based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, 765–774.
Cheng et al. (2019) Zhiyong Cheng, Xiaojun Chang, Lei Zhu, Rose C. Kanjirathinkal, and Mohan Kankanhalli. 2019. MMALFM: Explainable Recommendation by Leveraging Reviews and Images. ACM Trans. Inf. Syst. 37, 2, Article 16 (Jan. 2019).
Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In KDD ’17. ACM, 135–144.
Gajic and Baldrich (2018) B. Gajic and R. Baldrich. 2018. Cross-Domain Fashion Image Retrieval. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 1950–19502.
Gantner et al. (2010) Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. 2010. Learning Attribute-to-Feature Mappings for Cold-Start Recommendations. In 2010 IEEE International Conference on Data Mining. 176–185.
Gidaris et al. (2019) Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Perez, and Matthieu Cord. 2019. Boosting Few-Shot Visual Learning With Self-Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Gong et al. (2020) Lin Gong, Lu Lin, Weihao Song, and Hongning Wang. 2020. JNET: Learning User Representations via Joint Network Embedding and Topic Embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, 205–213.
Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
Guo et al. (2018) Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-Modal Preference Modeling for Product Search. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Machinery, 1865–1873.
He et al. (2016a) Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016a. Vista: A Visually, Socially, and Temporally-Aware Model for Artistic Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). Association for Computing Machinery, 309–316.
He et al. (2016b) Ruining He, Chunbin Lin, Jianguo Wang, and Julian McAuley. 2016b. Sparse hierarchical embeddings for visually-aware one-class collaborative filtering. In International Joint Conference on Artificial Intelligence.
He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 144–150.
Jain et al. (2019) Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chunduri, and Manik Varma. 2019. Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, 528–536.
Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM ’14). Association for Computing Machinery, 675–678.
Jiang et al. (2016) Shuhui Jiang, Yue Wu, and Yun Fu. 2016. Deep Bi-Directional Cross-Triplet Embedding for Cross-Domain Clothing Retrieval. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM ’16). Association for Computing Machinery, 52–56.
Kendall et al. (2017) Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. CoRR abs/1705.07115 (2017). arXiv:1705.07115 http://arxiv.org/abs/1705.07115
Li et al. (2020) H. Li, Y. Wang, Z. Lyu, and J. Shi. 2020. Multi-task Learning for Recommendation over Heterogeneous Information Network. IEEE Transactions on Knowledge and Data Engineering (2020), 1–1.
Liu et al. (2020) Siwei Liu, Iadh Ounis, Craig Macdonald, and Zaiqiao Meng. 2020. A Heterogeneous Graph Neural Model for Cold-Start Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, 2029–2032.
Liu et al. (2012) S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 3330–3337.
McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR.
Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). arXiv:1301.3781
Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS’13). Curran Associates Inc., USA, 3111–3119.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
Rawat and Kankanhalli (2016) Yogesh Singh Rawat and Mohan S Kankanhalli. 2016. ConTagNet: Exploiting user context for image tag recommendation. In Proceedings of the 24th ACM international conference on Multimedia. 1102–1106.
Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian Personalized Ranking from Implicit Feedback. CoRR abs/1205.2618 (2012). arXiv:1205.2618 http://arxiv.org/abs/1205.2618
Saveski and Mantrach (2014) Martin Saveski and Amin Mantrach. 2014. Item Cold-Start Recommendations: Learning Local Collective Embeddings. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, Silicon Valley, California, USA) (RecSys ’14). Association for Computing Machinery, 89–96.
Schein et al. (2002) Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. 2002. Methods and Metrics for Cold-Start Recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland) (SIGIR ’02). Association for Computing Machinery, 253–260.
Shankar et al. (2017) Devashish Shankar, Sujay Narumanchi, H. A. Ananya, Pramod Kompalli, and Krishnendu Chaudhury. 2017. Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce. CoRR abs/1703.02344 (2017). arXiv:1703.02344 http://arxiv.org/abs/1703.02344
Shi et al. (2019) C. Shi, B. Hu, W. X. Zhao, and P. S. Yu. 2019. Heterogeneous Information Network Embedding for Recommendation. IEEE Transactions on Knowledge and Data Engineering 31, 2 (2019), 357–370.
Vasile et al. (2016) Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec: Product Embeddings Using Side-Information for Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). Association for Computing Machinery, 225–232.
Wang et al. (2017) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017. What your images reveal: Exploiting visual contents for point-of-interest recommendation. In Proceedings of the 26th International Conference on World Wide Web. 391–400.
Xu et al. (2020) X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, and X. Li. 2020. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval. IEEE Transactions on Cybernetics 50 (2020), 2400–2413.
Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, 974–983.
Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 52, 1, Article 5 (Feb. 2019).