Investigating LLM Applications in E-Commerce

Chester Palen-Michel [email protected] Mitchom School of Computer Science Brandeis UniversityWalthamMassachusettsUSA , Ruixiang Wang [email protected] eBaySan JoseCaliforniaUSA , Yipeng Zhang [email protected] eBaySan JoseCaliforniaUSA , David Yu [email protected] eBaySan JoseCaliforniaUSA , Canran Xu [email protected] eBaySan JoseCaliforniaUSA and Zhe Wu [email protected] eBaySan JoseCaliforniaUSA

(2024; 16 August 2024; 30 August 2024; 25 October 2024)

Abstract.

The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.

Large Language Models, LLM, E-Commerce, Text Generation

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: ; August 22, 2024; ^†^†ccs: Applied computing E-commerce infrastructure^†^†ccs: Computing methodologies Natural language generation^†^†ccs: Computing methodologies Information extraction

1. Introduction

Large Language Models (LLMs) have recently gained ubiquitous use in many domains(Singhal et al., 2022; Trautmann et al., 2022; Malinka et al., 2023). In the e-commerce domain in particular, LLMs have the potential to facilitate the creation of product descriptions, summarize reviews, expand queries, and answer buyer and seller questions, among other potential use cases.

We simulated a common practice in the e-commerce industry: adapting open source state of the art models for domain specific tasks. We compared the feasibility of using an LLM and explored to what extent using an LLM for different e-commerce tasks leads to gains in the evaluation metrics. Training LLMs from scratch requires a significant amount of resources, but a common practice to efficiently train the model is to use parameter efficient methods like low rank adapters (LoRA) (Hu et al., 2021). Important questions arise when attempting to adapt these models for specific tasks and domains. We attempt to answer in this work: How much training data is needed to adapt a model to a task? How much do LLMs outperform more traditional approaches? What ways do different tasks interact with each other when doing mixed dataset training or merging LoRA weights trained on indivudal tasks? There are various approaches for adapting LLMs for tasks in a specific domain. Specifically, we focus on LoRA supervised fine-tuning (SFT), multi-task training, zero-shot inference, and LoRA merging.

Our contributions are as follows: 1) Organizing and formatting four e-commerce datasets for the evaluation of large language models (LLMs). 2) Conducting comprehensive experiments to compare fine-tuning a large language model with conventional industrial baselines, such as BERT and T5, using varying amounts of data for e-commerce tasks; additionally, we examine the effectiveness of in-context learning with a highly competitive, very large LLM. 3) Exploring the use of mixed LoRA merging across different tasks and comparing this approach to traditional mixed dataset training.

Our findings indicate that for e-commerce-specific tasks, conventional methods, such as training smaller language models, can achieve performance that is comparable to or even surpasses that of general-purpose very large LLMs. These valuable insights provide for the application of these models within the e-commerce industry.

2. Background

Large Language Models (LLMs) (Chowdhery et al., 2023; OpenAI et al., 2023; Anil et al., 2023; Almazrouei et al., 2023; Touvron et al., 2023) have seen increasing attention in recent years as models that perform natural language generation have begun to be used for multiple tasks. They differ from prior pre-trained language models (PLMs), such as BERT (Devlin et al., 2019) or T5 (Raffel et al., 2020), in their amount of training data and number of parameters.

2.1. Instruction Fine-tuning

Instruction fine-tuning represented a pivotal advancement in the optimization of large language models (LLMs), such as GPT-4, for enhanced task-specific performance, especially in domain-specific applications (Hu et al., 2023; Zhang et al., 2024). This method involved the supplementary training of a pre-trained based model such as GPT (OpenAI et al., 2023), Llama (Touvron et al., 2023), or Falcon (Almazrouei et al., 2023) on a task specific dataset consisting of prompts paired with their optimal responses. The objective was to refine the model’s capacity to comprehend and execute instructions with increased accuracy and contextual relevance. Instruction fine-tuning has emerged as an invaluable technique for augmenting the proficiency of LLMs across various specialized domains, ensuring their outputs align more closely with user expectations and requirements.

2.2. Low-Rank Adaptation Training

Low-Rank Adaptation (LoRA) (Hu et al., 2021) is an innovative technique designed to fine-tune (LLMs) in a resource-efficient manner. This method addresses the challenge of adapting pre-trained models to specialized tasks without the extensive computational costs associated with traditional full-model fine-tuning. At the heart of LoRA is the strategic introduction of trainable low-rank matrices that target specific components of the LLM’s architecture, namely the attention and feed-forward neural network layers inherent to the transformer model. Specifically, it freezes the pre-trained layers of the LLM, and for each layer, it trains a rank-decomposition matrix and injects them into each layer of the pre-trained model to accomplish the LLM fine-tuning.

LoRA involves the addition of low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to the existing weight matrices $\mathbf{W}$ of the model. The model’s original parameters are kept frozen during fine-tuning while $\mathbf{A}$ and $\mathbf{B}$ are updated. These matrices $\mathbf{A}$ and $\mathbf{B}$ are much smaller in size compared to $\mathbf{W}$ , enabling significant reductions in the number of trainable parameters. The adaptation occurs through the equation. $\mathbf{W}^{\prime}=\mathbf{W}+\mathbf{A}*\mathbf{B}^{T}$ , where $\mathbf{W}^{\prime}$ represents the adapted weights. This process selectively fine-tunes the model, allowing it to acquire new capabilities or improve performance on specific tasks with minimal adjustments to its pre-trained parameters. This selective updating is particularly beneficial for domain-specific applications, where only certain aspects of the model’s knowledge need refinement.

2.3. Evaluation

With the rise of text generation models that are seemingly capable of performing large numbers of tasks and able to answer many questions, a number of evaluation strategies have been proposed. Evaluation leaderboards often consist of evaluation tasks like Hellaswag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021b), and others (Gao et al., 2023; Hendrycks et al., 2021b, a) which cover a broad range of multiple choice questions. These types of multiple choice question evaluations, where the answer is chosen based on the choice with the highest likelihood can be brittle as rankings can be sensitive to minute details (Alzahrani et al., 2024). Other evaluation benchmarks like GLUE (Wang et al., 2018) consist of a bundle of different tasks with task specific evaluation metrics. Another approach to evaluating LLM performance is the approach of LLM as a judge. Chatbot arena (Zheng et al., 2024) is an example of this style of evaluation and while it can correlate with human judgment, like other LLM applications it can be subject to hallucinations. In this work, we focus on directly evaluating the tasks of interest with existing scoring practices.

2.4. LLMs for E-commerce

Zhang et al. (2024) find that fine-tuning LLM scaling factors appear to be very task dependent; however, there has been little published work examining fine-tuning focusing on the e-commerce domain. There has been some prior work investigating the use of LLMs on e-commerce tasks. ECOMGPT (Li et al., 2023) looked at framing e-commerce tasks as instruction fine tuning, but doesn’t explore LoRA (Hu et al., 2021) or how different tasks enhance or interfere with each other or the amount of data required for reasonable performance.

3. E-Commerce Datasets

There are a limited number of e-commerce datasets publicly available. Currently, there are few e-commerce benchmarks for evaluating LLMs on e-commerce tasks. We collected four datasets covering classification, sequence labeling, and product description generation, and review summarization in order to evaluate the performance of LLMs in the e-commerce domain.

Task	Train	Dev	Test
ESCI Classification	1,393,063	-	425,762
QueryNER: Query Segmentation	7,841	871	933
Review Summarization	25,203	3,114	3,166
Description Generation	431,470	-	103,865

Table 1. Sizes of dataset splits by input prompt and output pair for each task.

3.1. ESCI Multi-class Product Classification

The Shopping Queries ESCI dataset (Reddy et al., 2023) contains search queries, released with the aim of fostering research in the area of semantic matching of queries and products. The dataset contains three tasks: Query-Product Ranking, Multi-Class Product Classification, Product Substitute Identification. We use the ESCI Multi-Class Product Classification task. The task is to classify a query and product pair as an exact match (E), a substitute (S), a complementary product (C) or Irrelevant (I). Because Query-Product Ranking and Product Substitute Identification involve more complexity and longer input strings, we do not include them for LLM evaluation in this work.

3.2. QueryNER

QueryNER (Palen-Michel et al., 2024) is an e-commerce query segmentation dataset. The task in QueryNER is not to extract aspects, but rather to segment the user’s query into meaningful chunks. QueryNER uses a subset of the Amazon Shopping Queries Dataset (Reddy et al., 2023) as the underlying data. QueryNER consists of an ontology of 17 types. The entity types are: core_product_type, product_name, product_number, modifier, creator, condition, unit of measurement (UoM), department, material, time, content, color, shape, quantity, occasion, origin, and price.

Because QueryNER uses BIO encoding (B for begin, I for inside a span, O for outside a span), we linearized the data in order to have an input prompt and output string. The formatting of the linearized data is a series of (token, label) pairs.

3.3. Review Summarization

AMASUM (Bražinskas et al., 2021) is a dataset for summarizing product reviews. The product reviews are in English and come from bestreviews.com, cnet.com, pmag.co.uk, runrepeat.com, which mainly consist of electronics and running shoes reviews. The dataset contains a list of product reviews and a summary with the “verdict” on the product and also lists of pros and cons. The original paper focuses on selecting useful reviews in order to summarize.

The goal of our evaluation of LLM performance is not to assess its review selection capability, so we select a small number of reviews to summarize. We select only 4 reviews to be used to generate the verdict summary. Since the dataset has a field with helpful votes where users voted that the review was helpful, we take the top four reviews with the most helpful votes as our selection process.

3.4. Product Description Generation

There appears to be a lack of standard benchmark datasets in English for product description generation despite a decent amount of prior work. Koto et al. (2022) stated they were not able to release the dataset due to copyright issues. Chan et al. (2019) and Zhang et al. (2019) collected data from Taobao¹¹1https://www.taobao.com/. Zhang et al. (2019) also stated that there was no other standard dataset for product description generation. Wang et al. (2017) created their dataset from attribute values and descriptions from Amazon but only in the category ”Computers and Tablets”.

Without a clear prior benchmark for this task, we create a simple product description task from the ESCI Shopping Queries Dataset (Reddy et al., 2023). We assemble an input consisting of a product title, brand, color, and bullet points. The bullet points in the original dataset are aspect-value pair information about a product or short snippets about the product. The expected output is the product description. We filter out items where there is no title, no description, no bullet points, or items where the description is an exact match of the title or bullet points.

3.5. Task Alignment & Prompt Design

To enable instruction fine-tuning, each individual dataset was required to be aligned to a sequence to sequence style task in order to be used with an LLM. Review summarization and product description generation already were easily treated as sequence to sequence tasks. However it was less obvious how to best treat classification and sequence labeling tasks. We treated the classification label as text to be generated. For sequence labeling, the model was expected to output tuples of each token along with its label.

Prompts were designed to provide the model with enough context to accomplish its task. We created task specific templates for each of the tasks. Each prompt asked the model to act as an e-commerce expert to provide context. The prompt then gave a brief description of the task and along with the training example. Examples of prompts are shown in Appendix Appendix.

4. Experiments

4.1. Baselines

We chose to use the most common and competitive baselines for each e-commerce task. For classification tasks, ESCI Task 2 and QueryNER, we chose to use BERT (BERT-base) (Devlin et al., 2019) as the baseline model with a learning rate of 3e-5 following Devlin et al. (2019). The training of BERT followed the conventional formulation of the sequence classification problem for the ESCI task and token classification for the QueryNER task.

For generative tasks, review summarization, and product description generation, we chose to use T5 (T5-base) (Raffel et al., 2020) as a baseline model. T5 and BERT are fine-tuned with default parameters released by Hugging face (Wolf et al., 2020). Because in-context learning has been shown to be effective (Dong et al., 2024), we use we include Mixtral 8 x 22b (Jiang et al., 2024) as a theoretical state of the art zero-shot and few-shot baseline.

4.2. Mix Tasks Training

Mixing tasks (datasets) for fine-tuning large language models (LLMs) can enhance the model’s performance, generalization ability, and adaptability to various tasks, which closely mirrors the industrial application. The trained model was not required to accomplish one task but rather several domain-specific tasks such as query named entity recognition (NER), text summarization, description generation, and classification. Fine-tuning LLMs with mixed and diverse training datasets could help improve performance on each task.

4.3. Mix LoRA Merging

Furthermore, thanks to the nature of the LoRA framework, mixed tasks (datasets) training could also be achieved by merging LoRA weights independently trained on different tasks. Specifically, assume there were $n$ tasks with a specific task number $i\in\{0,\ldots,n-1\}$ . LoRA weights trained independently from each task could be defined as $\mathbf{A}_{i}$ and $\mathbf{B}_{i}$ . The adaptation equation for mixed LoRA merge was $\mathbf{W}^{\prime}=\mathbf{W}+\frac{1}{n}\sum_{i=0}^{n-1}\mathbf{A}_{i}\mathbf{B}_{i}^{T}$ . Mixing LoRA merging provided additional flexibility since additional tasks could be added later instead of retraining whenever a new task was added. Additionally, some types of tasks may benefit each other, while others may lower another tasks performance when merged. Mixing LoRA merge enabled more efficient experimentation with different combinations of tasks compared with directly training with the mixed data set.

4.4. Implementation Details

The foundation model used in all LoRA fine-tuning experiments was Llama-2-7b as its moderate size and considerable performance on various tasks (Touvron et al., 2023). The supervised fine-tune was conducted with $3$ epochs for each dataset, as we do not see performance gain while we trained it longer. Since all four tasks for the LLM were formulated as a text generation task (see Section 3.5), we followed the most common hyperparameters to finetune the LLM under the text generation setup. Specifically, the model was loaded with $8$ bit, the max length of input was set to $1500$ , and the precision of the parameter was set to be bf16. During the supervised fine-tune, we adopted the LoRA (Hu et al., 2021) to conduct efficient model training. The LoRA $\alpha$ was set to be $16$ , with dropout rate be $0.05$ . The initial learning rate was chosen to be $3e-5$ with a cosine learning rate scheduler. To further improve the training efficiency, we adopted paged_adamw_8bit as optimizer. The LLM model training were conducted on two NVIDIA A100 80GB GPUs.

To optimize computational time, we randomly sampled subsets from each dataset for our experiments. Specifically, for the ESCI dataset, we used samples of 5k, 10k and 50k as the training set and sampled 1k from the original test set to serve as the test set. For the QueryNER dataset, we selected 1k, 5k, and 8k (full 7,841 data samples) samples for training and used the entire original test set as the test set. In the case of the description generation dataset, we randomly chose 1k, 5k, 10k, and 25k samples for training, with 10k samples from the original test set used as the test set. Finally, for the Review Summarization dataset, we sampled 1k, 5k, 10k, and 25k (the entire 25,203 dataset) as the training set and utilized the complete original test set as the test set. All experiments were conducted using the same aforementioned hyperparameters.

5. Results

5.1. Metrics on different tasks

Since the datasets we explored contained both classification and generative tasks, we use the task-specific metric to evaluate the performance of the model performance on various datasets. For classification tasks, we use F1 score as evaluation metrics and report both micro-average and macro-average results, while for the generative task, we use the Rouge-1 F1 (Rouge-1) and Rouge-L F1 (Rouge-L) (Lin and Och, 2004) to evaluate the performance.

5.2. Evaluating LLM on Classification Tasks

To ensure accurate mapping of the generated text to the actual class label in the classification task, we implemented a simple but strict evaluation approach. For ESCI classification, the LLM output was required to match the corresponding label exactly. Any deviation from the exact label resulted in the classification being considered incorrect. For QueryNER, labels were extracted using a regular expression expecting a list of tuples of tokens with BIO tags. If the LLM output deviated from the expected output, the labels were considered all Os (no entities identified). Because we noticed the model sometimes being inconsistent with the output format, the regular expression also handles comma separated output without parentheses. In the case of further formatting issues or when the model does not predict labels for all tokens, it is assumed the model failed to generate a valid label sequence and that particular example is assigned all Os. Palen-Michel et al. (2021) highlighted issues with NER scoring. For scoring procedure clarity, once the model output has been extracted from the linearized form, we use seqeval (Nakayama, 2018) for evaluation using the setting which is equivalent with conlleval.

5.3. SFT LLMs vs Baseline PLMs

We performed SFT training of Llama2-7b on each dataset with different portions of the data to explore the impact of training data size for fine-tuning an LLM on each task.

5.3.1. Classification tasks

	ESCI Classification			QueryNER
	Micro F1	Macro F1		Micro F1	Macro F1
BERT @5k	0.348	0.181	@1k	0.539	0.390
LLM SFT @5k	0.355	0.244	@1k	0.280	0.156
BERT @10k	0.628	0.294	@5k	0.580	0.508
LLM SFT @10k	0.397	0.213	@5k	0.553	0.398
BERT @50k	0.629	0.368	@8k	0.603	0.569
LLM SFT @50k	0.628	0.294	@8k	0.626	0.579
Mixtral 0-shot	0.571	0.199		0.145	0.063
Mixtral 3-shot	0.537	0.009		0.484	0.336

Table 2. Comparison of Llama2-7b SFT, finetune BERT and in context learning using Mixtral 8 x 22b model on ESCI task 2 dataset and QueryNER dataset.

Table 2 shows the performance comparison on classification tasks on ESCI Classification and QueryNER dataset among Llama2-7b Supervised fine-tuning, BERT model fine-tuning and in context learning using Mixtral 8 x 22b in zero and few-shot setup. Note that, instead of generating the distribution like BERT, the task for the LLM is to generate the classification result in text. As the number of the training samples increased the performance of the model generally increased. However, there was a clear performance boost of LLM as the number of training samples increased (from 10k to 50k on ESCI task 2 dataset and from 1k to 5k on Query NER dataset). In general, the LLM and BERT performed comparable in these classification tasks when given sufficient training data.

In domain-specific tasks such as ESCI classification and Query NER, the application of in-context learning with very large language models like the Mixtral 8x22b often does not meet the performance benchmarks achieved through fine-tuning. Despite the introduction of extensive context, these models frequently struggle to deliver the level of accuracy required for industrial applications. This observation underscores a critical limitation: while LLMs are versatile and powerful, they may not be inherently optimized for tasks that demand high precision within a specialized domain.

In contrast, fine-tuning enables models to be specifically tailored to the intricacies of the task at hand, allowing for a deeper understanding of domain-specific patterns and nuances. As a result, training smaller, task-specific models such as BERT, particularly those employing a softmax classification layer, often emerges as a more effective strategy. These models not only demonstrate superior performance but also offer advantages in computational efficiency, making them more suitable for deployment in resource-constrained industrial environments where both accuracy and efficiency are paramount.

5.3.2. Generation task

	Review Summarization		Desc. Generation
	Rouge-1	Rouge-L	Rouge-1	Rouge-L
T5 @1k	0.155	0.137	0.239	0.216
LLM SFT @1k	0.158	0.147	0.262	0.244
T5 @5k	0.162	0.147	0.241	0.223
LLM SFT @5k	0.182	0.161	0.249	0.232
T5 @10k	0.163	0.150	0.238	0.221
LLM SFT @10k	0.186	0.162	0.258	0.241
T5 @25k	0.169	0.158	0.232	0.215
LLM SFT @25k	0.187	0.165	0.237	0.222
Mixtral 0-shot	0.099	0.090	0.248	0.229
Mixtral 3-shot	0.144	0.131	0.274	0.254

Table 3. Comparison of Llama2-7b SFT with fine-tune T5 and zero-shot Mixtral 8 x 22b on Review Summarization dataset and Product Description Generation dataset.

Table 3 shows the performance comparison of text generation tasks on the review summarization and description generation datasets. Similar to the classification tasks, there was a significant increase in model performance as more data samples were used for training. Notably, the LLM consistently outperformed the conventional T5 model across both datasets. This superior performance can be attributed to the LLM’s larger model capacity and enhanced quality of pre-training.

Fine-tuned models (Llama2-7b and T5) outperformed the zero-shot capabilities of the much larger Mixtral 8x22B model in review summarization, while for description generation, the performance is comparable. Despite the Mixtral model’s strong standing on LLM leaderboards like (Fourrier et al., 2024), which suggests competitive summarization abilities, the observed performance gap between zero-shot and few-shot scenarios highlights a key limitation: without in-context guidance, the model struggles to achieve sufficient capability on domain-specific tasks (Review Summarization). However, when in-context information is provided, the model demonstrated significantly improved outcomes. In review summarization, to achieve even higher levels of performance, task-specific training becomes crucial. Notably, even with smaller model architectures, fine-tuning can yield superior results (using Llama2-7b).

In contrast, the description generation task is more aligned with general-purpose text generation, where the model’s ability to understand and leverage general knowledge is the primary factor in determining performance. Consequently, in this task, larger models like Mixtral, equipped with in-context guidance, could achieve top-tier performance, even surpassing fine-tuned smaller models.

	QueryNER		Review Summ.		Desc. Generation
	Micro F1	Marco F1	Rouge-1	Rouge-L	Rouge-1	Rouge-L
QueryNER @ 5k	0.553	0.398	-	-	-	-
+ Summ. LoRA @ 5k	0.002	0.232	0.192	0.164	-	-
+ Desc. Generation @ 5k	0.018	0.344	-	-	0.251	0.233
	ESCI		Review Summ.		Desc. Generation
ESCI @ 5k	0.355	0.244	-	-	-	-
+ Summ. LoRA @ 5k	0.145	0.174	0.184	0.156	-	-
+ Desc. Generation LoRA @ 5k	0.137	0.299	-	-	0.239	0.221

Table 4. Results of merging LoRA for different pairs of tasks.

5.4. LoRA Merge

We experimented with merging different pairs of LoRA weights for each pair of tasks. For this experiment, we used the LoRA weights from the 5k training set size. To merge the LoRA weights we took the average of the two. The results of merging LoRA weights, shown in Table 4, demonstrated that when weights trained on a task requiring a more rigid structure in the output like ESCI classification or QueryNER, the performance on those tasks suffers. However, the performance of description generation and review summarization remained comparable with the performance from independent training with the same number of examples.

Upon reviewing model output, we found that at least a portion of this degradation in performance on the tasks requiring a more strict output format was attributable to the output formatting requirements. However, some of this apparent degradation may not truly be quite as bad as it appears.

In Section C of the appendix, we show an example of model output for the QueryNER task was shown where the model did in fact output BIO labels some of which are correct labels, but with formatting unrecognized by the scoring script.

5.5. Comparison with Mix Tasks Training

ESCI Class.

QueryNER

Review Sum.

Desc. Gen.

Micro F1

Rouge-L

Indep. train

0.355

0.553

0.161

0.232

Mix train

0.342

0.455

0.163

0.218

Mix LoRA

0.251

0.001

0.145

0.174

Table 5. Comparison of independent task training with BERT / T5, mixed task training with LoRA Llama2 7b, and mixed LoRA merging for Llama2 at 5k examples for each task

We merged the LoRA (Low-Rank Adaptation) weights from all four tasks for inference and performance analysis on each respective test set. The weights were chosen based on training with 5k samples from each dataset to ensure balanced information contribution. To establish a fair comparison between training with mixed data and mixing through LoRA weight merging, we trained the foundation model (Llama-2 7B) on 5k samples from each dataset, resulting in a total of 20k training samples, using the same hyperparameters as in the previous experiments.

The results, detailed in Table 5, indicated that mixed dataset training resulted in lower performance compared to training each task independently. This performance decrease was attributed to the limited capacity of the LoRA adapter and the distinct nature of each task, which deteriorated the model’s ability to consistently produce outputs in the correct format especially in the classification tasks. Furthermore, the LoRA weight merging approach generally showed inferior performance compared to both mixed dataset training and independent task training. In mixed LoRA merging, classification tasks particularly suffered, with output format issues noted (see Section 4). However, for text generation tasks, the performance remained competitive.

6. Conclusion

In this paper, we explored the application potential of LLMs in addressing common e-commerce tasks, benchmarking against conventional industrial models like BERT and T5. We collected four e-commerce datasets, containing classification and text generation tasks, and adapted both task types to a text generation framework for LLMs. Our findings reveal that: (1) Zero-shot with a larger LLM is not a clear win and that smaller models fine-tuned on a specific task can have better task specific performance. (2) LLMs required a certain volume of training data to reliably produce the correct formats in classification tasks, yet they achieve competitive performance against the conventional BERT baseline and surpass the T5 baseline in text generation tasks. (3) Mix task training appears to perform slightly worse but comparable to independent training. (4) While LoRA merging for classification tasks did not consistently maintain output format, it demonstrated that merging in text generation tasks could still deliver competitive performance on individual tasks. (5) Few shot performance seems to be better than zero shot or fine tuning with very small amounts of data, but in all tasks except description generation, fine-tuning a model with enough data still performed better. Overall, we demonstrated that for specific tasks in the e-commerce domain, out-of-the box zero-shot LLM inference like our use of mixtral 8 x 22b does not outperform fine-tuning approaches on target tasks and there are opportunities for further exploration of adapting LLMs to the e-commerce domain.

References

(1)
Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023).
Alzahrani et al. (2024) Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. 2024. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards. arXiv:2402.01781 [cs.CL]
Anil et al. (2023) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2023. PaLM 2 Technical Report. arXiv:2305.10403 [cs.CL]
Bražinskas et al. (2021) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2021. Learning Opinion Summarizers by Selecting Informative Reviews. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9424–9442. https://doi.org/10.18653/v1/2021.emnlp-main.743
Chan et al. (2019) Zhangming Chan, Xiuying Chen, Yongliang Wang, Juntao Li, Zhiqiang Zhang, Kun Gai, Dongyan Zhao, and Rui Yan. 2019. Stick to the Facts: Learning towards a Fidelity-oriented E-Commerce Product Description Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4959–4968. https://doi.org/10.18653/v1/D19-1501
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dong et al. (2024) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL] https://arxiv.org/abs/2301.00234
Fourrier et al. (2024) Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. 2024. Open LLM Leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.10256836
Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021b. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
Hu et al. (2023) Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5254–5276. https://doi.org/10.18653/v1/2023.emnlp-main.319
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of Experts. arXiv:2401.04088 [cs.LG] https://arxiv.org/abs/2401.04088
Koto et al. (2022) Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2022. Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5). Association for Computational Linguistics, Dublin, Ireland, 234–243. https://doi.org/10.18653/v1/2022.ecnlp-1.27
Li et al. (2023) Yangning Li, Shirong Ma, Xiaobin Wang, Shen Huang, Chengyue Jiang, Hai-Tao Zheng, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. arXiv preprint arXiv:2308.06966 (2023).
Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 605–612.
Malinka et al. (2023) Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnák, and Filip Janus. 2023. On the educational impact of chatgpt: Is artificial intelligence ready to obtain a university degree?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1. 47–53.
Nakayama (2018) Hiroki Nakayama. 2018. seqeval: A Python framework for sequence labeling evaluation. https://github.com/chakki-works/seqeval Software available from https://github.com/chakki-works/seqeval.
OpenAI et al. (2023) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Palen-Michel et al. (2021) Chester Palen-Michel, Nolan Holley, and Constantine Lignos. 2021. SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics, Punta Cana, Dominican Republic, 40–50. https://doi.org/10.18653/v1/2021.eval4nlp-1.5
Palen-Michel et al. (2024) Chester Palen-Michel, Lizzie Liang, Zhe Wu, and Constantine Lignos. 2024. QueryNER: Segmentation of E-commerce Queries. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (Eds.). ELRA and ICCL, Torino, Italia, 13455–13470. https://aclanthology.org/2024.lrec-main.1178
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Reddy et al. (2023) Chandan Reddy, Nurendra Choudhary, Lluis Marquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, and Arnab Biswas. 2023. Shopping queries dataset: A large-scale ESCI benchmark for improving product search. arXiv preprint arXiv:2206.06588. https://arxiv.org/pdf/2206.06588.pdf
Singhal et al. (2022) Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical Knowledge. arXiv:2212.13138 [cs.CL]
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Trautmann et al. (2022) Dietrich Trautmann, Alina Petrova, and Frank Schilder. 2022. Legal prompt engineering for multilingual legal judgement prediction. arXiv preprint arXiv:2212.02199 (2022).
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
Wang et al. (2017) Jinpeng Wang, Yutai Hou, Jing Liu, Yunbo Cao, and Chin-Yew Lin. 2017. A Statistical Framework for Product Description Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, 187–192. https://aclanthology.org/I17-2032
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4791–4800. https://doi.org/10.18653/v1/P19-1472
Zhang et al. (2024) Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. arXiv:2402.17193 [cs.CL]
Zhang et al. (2019) Tao Zhang, Jin Zhang, Chengfu Huo, and Weijun Ren. 2019. Automatic generation of pattern-controlled product description in e-commerce. In The World Wide Web Conference. 2355–2365.
Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).

Appendix

Appendix A Example samples from each Dataset

Task	Prompt Input	Response
ESCI Classification	Query: revent 80 cfm Product: product_title: Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan product_brand: Panasonic product_color: White product_bullet_point: • WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air • Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace • This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan • 0.35 amp	Irrelevant
QueryNER	yarn swift carrying case	(yarn, B-modifier), (swift, I-modifier), (carrying, B-core_product_type), (case, I-core_product_type)
Description Generation	Title: Delta Electronics RAD80L BreezRadiance 80 CFM Heater/Fan/Light Combo White (Renewed) Brand: DELTA ELECTRONICS (AMERICAS) LTD. Color: White Bullet points: • Quiet operation at 1.5 sones • Built-in thermostat regulates temperature. Energy efficiency at 7.6 CFM/Watt • Precision engineered with DC brushless motor for extended reliability, this fan will outlast many household appliances • Galvanized steel construction resists corrosion • Duct: Detachable 4-inch Plastic Duct Adapter	This pre-owned or refurbished product has been professionally inspected and tested to work and look like new. How a product becomes part of Amazon Renewed, your destination for pre-owned, refurbished products: A customer buys a new product and returns it or trades it in for a newer or different model. That product is inspected and tested to work and look like new by Amazon-qualified suppliers. Then, the product is sold as an Amazon Renewed product on Amazon. If not satisfied with the purchase, renewed products are eligible for replacement or refund under the Amazon Renewed Guarantee.
Review Summarization	I expected a lot more from this product. When something says it will ””thicken”” or give lots of body & fullness, most products deliver. This one does not. I have very fine thin hair & this barely gave me an ounce of body. Very disappointed & not worth the $. =(I purchased at a salon before. This bottle from Amazon, was not as thick. Like it had been watered down. I wouldn’t buy again. I was buying this at a local spa but they stopped stocking it so I did what every desperate person does-went to Amazon to see if they carried it. I’ve been using this product for years and it solves my thinning-hair problem perfectly with no stiffness and no nasty smell.My favorite blow dry gel. Soft and natural looking. Smells fresh and nice.	A good budget spray gel option for those who want a quality hold without the higher price tag.

Table 6. Instructions for each dataset to guide the SFT.

In Table 6, an example from each dataset is shown.

Appendix B Prompt

Task	Template
ESCI Classification	Act as an e-commerce expert. Given a query and a related product retrieved for this query, the goal of this task is to classify the product as being an Exact, Substitute, Complement, or Irrelevant match for the query. Query: {The input query string} Product: {The product title} … Please answer with label “Exact”, “Substitute”, “Complement” or “Irrelevant” only.
QueryNER	Act as an e-commerce expert. Identify the entities in the following query. Use BIO tagging format as if this is a named entity recognition or chunking task. The format should be tuples of the token followed by the tag. For example: (air, B-modifier), (tight, I-modifier), (containers, B-core_product_type), (for, B-modifier), (food, I-modifier). The entity types are: UoM, color, condition, content, core_product_type, creator, department, material, modifier, occasion, origin, price, product_name, product_number, quantity, shape, time Query: {The input query string}
Descr. Generation	Act as an e-commerce expert. Given the following information about a product, generate a description for the product. Title: {Product Title} Brand: {Brand} Color: {Color} Bullet points: {bullets}
Review Summarization	Act as an e-commerce expert. Summarize the following product reviews.

Table 7. Prompt templates used with each e-commerce task.

In Table 7, prompt templates for each task are shown.

Appendix C Improperly Formatted LLM Output

An example of model output for the QueryNER task is shown in Table 8 where the model did in fact output BIO labels some of which are correct labels, but with formatting unrecognized by the scoring script.

Prompt

Act as an e-commerce expert. Identify the entities in the following query. \nUse BIO tagging format as if this iss

a named entity recognition or chunking task. \nThe format should be tuples of the token followed by the tag. \nFor example: (air,

B-modifier), (tight, I-modifier), (containers, B-core_product_type), (for, B-modifier), (food, I-modifier). \nThe entity types are::

\nUoM, color, condition, content, core_product_type, creator, department, material, modifier, occasion, origin, price, product_namm

e, product_number, quantity, shape, time\nQuery: sheets for queen size bed sale

Label

(sheets, B-core_product_type), (for, B-modifier), (queen, B-UoM), (size, I-UoM), (bed, B-modifier), (sale, B-modifier)

Output

sheets (B-core_product_type), (for, B-modifier), (queen, I-modifier), (size, I-modifier), (bed, I-modifier), (sale, B-modifier)

Table 8. Example of format discrepancy in the output from the LLM when merging the LoRA adaptor trained on the QueryNER and Description Generation datasets.