Generating Regular Expressions Using Large Pre-trained Language Model
Step-by-step Regex Generation via Chain of Inference
Abstract
NL2RE advocates an idea of helping engineers and/or end users generate regular expressions from natural language queries. However, it still remains a strong challenge in improving its precision and interpretability. In this paper, we propose NeRGic, a large pre-trained language model (PLM) based approach to performing NL2RE.
NeRGic decomposes the generation of regular expressions into multiple interpretable steps, takes T5 as the backbone model to train a PLM for regular expression generation. A self-consistency mechanism is also employed to enhance the robustness.
The evaluation results clearly show the effectiveness of NeRGic. In particular, the top-5 DFA-EQ accuracy of NeRGic surpasses that of Deep-RegexMML and SoftRegex by 16.3% and 14.7% on NL-RX-Turk and KB13, respectively.
Automatically generating regular expressions from natural language description (NL2RE) has been an emerging research area. Prior works treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called NeRGic, which decomposes the generation of regexes into chains of step-by-step inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. Experimental studies on two public benchmarks demonstrate that NeRGic remarkably outperforms previous methods and achieves state-of-the-art performance.
Index Terms:
Regular Expression Generation, Large PLM, InterpretabilityRegex Generation, Chain of Inference, Self-Consistency Decoding