This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Domain Knowledge Graph Construction Via A Simple Checker

Yueling (Jenny) Zeng and Li-C. Wang
University of California, Santa Barbara
Department of Electrical and Computer Engineering, Santa Barbara, CA 93106-9560, USA
[email protected], [email protected]
Abstract

With the availability of large language models, there is a growing interest for semiconductor chip design companies to leverage the technologies. For those companies, deployment of a new methodology must include two important considerations: confidentiality and scalability. In this context, this work tackles the problem of knowledge graph construction from hardware-design domain texts. We propose an oracle-checker scheme to leverage the power of GPT3.5 and demonstrate that the essence of the problem is in distillation of domain expert’s background knowledge. Using RISC-V unprivileged ISA specification as an example, we explain key ideas and discuss practicality of our proposed oracle-checker approach.

1 Introduction

The release of large language models (LLMs), e.g., Brown et al. (2020); Bommasani et al. (2021); OpenAI (2023), has motivated many semiconductor chip design companies to license the technology for in-house use and explore various applications. One interesting application area is for better organizing the tremendous amount of text data accumulated over the years. For a semiconductor company, these data can include specifications, test plans, bug reports, meeting notes, and so on. They are collected for generations of designs from various stages of a design cycle. There is a constant need to find ways to improve the within-the-company knowledge accumulation and sharing over those data.

Figure 1: An example of KG construction using a prompt to GPT3.5. The input is a paragraph from the RISC-V ISA specification and the output is in RDF TTL format.

In this context, we study the feasibility of using an LLM for knowledge graph construction. There are two important considerations. To protect proprietary information, the LLM has to be used in an in-house environment closed from the outside world. Because many of those companies might not maintain sufficient resources to conduct effective fine-tuning or retraining of an LLM, it is more practical to assume that the LLM would be used as it is. Furthermore, it is also desirable that the construction process of knowledge graph can follow an iterative fashion, where the graph is expanded gradually as more text items are processed.

We therefore consider our knowledge graph construction (KGC) as the following: Given an ordered sequence of text items T1,,TnT_{1},\ldots,T_{n}, KGC processes one TiT_{i} at a time from 1 to nn and generates an individual knowledge graph (KG): gi=KGC(Ti)g_{i}=KGC(T_{i}). Let Gi1G_{i-1} be the KG after merging all g1,,gi1g_{1},\ldots,g_{i-1}. We obtain Gi=MERGE(Gi1,gi)G_{i}=MERGE(G_{i-1},g_{i}).

In this work, we use RISC-V unprivileged ISA specification (“the Spec”) Waterman and Asanovi´c (2019) as an example. We consider each paragraph as a text item. We choose GPT3.5 OpenAI as our LLM to use. For implementing the KGCKGC step, we rely only on prompting to the LLM.

Figure 1 shows an example of the KGC. The KG is represented in the Turtle Terse RDF Triple language (RDF TTL) Beckett and Berners-Lee (2011). The specific prompt in use is shown in the figure. The input paragraph can be seen in two parts: (1) definition of the SLTI instruction, and (2) definition of the SLTIU instruction.

Notice that the second sentence starts with “SLTIU is similar” by referencing to the first sentence. The RDF output is shown as four “Facts”. The most interesting aspect of the result is shown in “Fact 2” for SLTIU where the RDF duplicates all the predicates used in “Fact 1”, i.e. compareAgainst, comparesSignedNumbers, comparesUnsignedNumbers, and comparesWith. This indicates that the LLM does understand the phrase “is similar” and reflects its understanding in copying the RDF representation of SLTI.

1.1 Terminology

We will use several terms in this paper to help discussion. Refer back to Figure 1. An RDF output is given as multiple rdf blocks. We call each rdf block a Fact. Each Fact starts with a subject entity. For example, SLTI, SLTIU, Immediate, and Register, are subject entities. A Fact represents a set of triples each in the form (subject,predicate,object)(subject,predicate,object). An object entity is the one that appears as the object of a triple and is not a subject entity. We also differentiate two types of predicate. For a triple, if both of its subject and object are subject entities, we call the predicate a relation. Otherwise, we call it a feature. For example, in Figure 1 compareAgainst is a relation and comparesSignedNumbers is a feature.

For simplicity, we use the term “RDF” to refer to the RDF output given by GPT3.5.

1.2 Background Facts (BFs)

While Figure 1 shows some encouraging result, we notice that the RDF misses some detail in the original paragraph. For example, it does not differentiate the usages of the two registers “rd” and “rs1”.

Figure 2: Improved RDF obtained by repeating the example in Figure 1 and by supplying background facts (BFs).

Figure 2 shows another RDF by using a different prompt. This prompt adds a list of background facts (BFs) we manually created. The RDF shows an improvement. In particular, the RDF shows that the instructions compares register “rs1” with the signExtendedImmediate, and placesValuein “rd”. It even includes the detail that the comparison is “<<” (less than).

1.3 Focus of The Work

The example above show that GPT3.5 alone can be used for KGC and produce reasonably good result. Adding BFs can help improve the result further. It looks promising. However, it turned out that without adding BFs, the example was one of the few easy cases we encountered. Others were not as easy. Without BFs, an output RDF could be unsatisfactory due to two reasons (see examples in Appendix A.1): (1) the RDF failed the syntactic check, and (2) the RDF passed the syntactic check but either got some fact(s) wrong or entirely missed the main fact(s) described in the paragraph. In our experiment, we estimated at least 70% of the cases were in these two categories.

Figure 2 shows that adding BFs can influence the behavior of GPT3.5 for KGC. Then, it is interesting to see whether we can rely solely on adding BFs to reach a satisfactory RDF for every paragraph or not. Ideally, we would like to turn the KGC process for every hard case into an easy case (like Figure 2). This work studies if the idea is feasible.

1.4 Contributions

In this work we show that the simple approach of calling GPT3.5 with BFs can be sufficient. The contributions can be viewed in three aspects:

(1) We propose an Oracle-Checker (OC) scheme to utilize GPT3.5. In this scheme, we restrict ourselves to use a simple checker for practicality reason. A simple checker means that the checker does not involve sophisticated techniques (e.g. syntactic or semantic analyses, embedding training, etc.). A simple checker promotes practicality because it is easier for the field engineers to adopt.
(2) With our OC scheme, we show that GPT3.5 is sufficient for the KGC task. The key is to “program” the GPT3.5’s KGC process with proper BFs. Because these BFs are manually prepared, the essence of the KGC process can be seen as distillation of those BFs, i.e. the required domain knowledge.
(3) Through experiments, we observe and summarize findings regarding the GPT3.5’s KGC process. Our findings can help others who desire to use GPT for a similar domain KGC task.

2 Background and Related Work

As far as we know, we are among the first to tackle the problem of KGC for unstructured text data in the semiconductor chip design domain. Documents in such a company often involve terminologies not known to the outside world. Hence, it is intuitive to think that KGC for those texts requires a domain person to provide some background knowledge to at least cover those terminologies. This thinking motivated us to take the view of supplying BFs.

Knowledge graph construction is a rich field with many techniques having been proposed to solve the problem Yan et al. (2018); Ye et al. (2022). Conventional methods to constructing knowledge graphs follow a pipeline of NLP sub-tasks Luan et al. (2018) such as entity recognition Tjong Kim Sang and De Meulder (2003), entity linking Milne and Witten (2008), relation extraction Zelenko et al. (2002), and coreference resolution Zelenko et al. (2004) etc. Among the various tasks, named entity recognition (NER) provides a fundamental first step for domain knowledge acquisition. Standard off-the-shelf toolkits for NER Bird (2006); Manning et al. (2014); Finkel et al. (2005); Liao and Veeramachaneni (2009) combines machine learning models and rule-based components to label entities. Recent works solve the problem in an end-to-end fashion using deep learning models Mondal et al. (2021); Harnoune et al. (2021); Ye et al. (2022). However, pre-trained models often incur low accuracy since training data from general public domain rarely covers our domain specific patterns. It is hard to iteratively ingest domain experts’ knowledge to further improve the accuracy without dedicated retraining. Finetuning or retraining is often not a desirable option for many hardware companies because of the tremendous efforts required to create curated databases for the training tasks. Another potential route to our KGC problem is to implement customized rule-based extractors with features provided by existing constituency and/or dependency parsers Chen and Manning (2014); Zhu et al. (2013). However, to our experience the set of rules can quickly grow overly-complicated and it is difficult for the approach to scale.

Restricting the output of a generative LM into a formal representation is related to the problem of constrained semantic parsing Shin et al. (2021); Lu et al. (2021). A structured meaning representation is often chosen as the output format Wolfson et al. (2020). However, converting the meaning representation into a KG can be another potential barrier. In summary, prompting an LLM to directly generate an RDF, if practical, can bypass all the complications mentioned above. This would not be feasible without the latest developments of the LLMs.

Despite that automatic knowledge graph construction in specific domain still remains an open challenge problem Yan et al. (2018), a major difference of this work is that we are not trying to propose another KGC solver. Rather, we focus on verifying the result given by such a solver, in our case the GPT3.5. In other words, our work is not about being a technology provider as those surveyed in Yan et al. (2018); Ye et al. (2022). Instead, we take the perspective of being a technology consumer. From this perspective, we study what features are required for a practical KGC solver.

3 The Oracle-Checker View

Our Oracle-Checker (OC) view was inspired by the theoretical model of Interactive Proofs (IP) Arora and Barak (2012). An IP system comprises a prover and a verifier. The prover is assumed to be an all-powerful machine. The verifier is a probabilistic machine. The IP approach was developed to characterize computational complexity classes.

In an IP system, the verifier interrogates the prover through a sequence of communications. At the end the verifier either is convinced that the answer provided by the prover is correct or reject it. An important aspect in the communications is to keep the prover honest. Because verifier’s computational power is limited, it is mostly prover’s job to make the verification task as easy as possible.

Figure 3: The proposed Oracle-Checker scheme

Our OC view is different from the IP approach though. This is because what we consider the oracle (corresponding to the prover) has a limited power in practice. In addition, our oracle is probabilistic. On the other hand, our checker still needs a way to keep the oracle “honest”, i.e. a way to verify the answer provided by the oracle. However, because our oracle is not all powerful, we can no longer expect the oracle to give a form of answer that is always easily verifiable. This makes the verification harder than that in the theoretical IP model.

Figure 3 depicts our OC scheme. In our scheme we introduce a human verifier. The verifier comprises two parts: the simple checker and the human verifier. The ultimate decision to accept or reject an answer (RDF) stays with the human verifier. The job of the simple checker is to analyze the answer and provide feedback to help the human verifier.

Instead of asking the oracle to make the verification task easier for the verifier, if needed we require the verifier to make the task easier for the oracle. In our OC scheme, there are only two ways the verifier can do this: (1) by providing BFs and (2) occasionally by splitting a paragraph into multiple sub-paragraphs. Note that making the task easier is opposite to that in the theoretical IP model where the task is made easier for the verifier. Therefore, we can think that in our OC scheme, the oracle is powerful but not all powerful, and some of the power still resides with the human verifier.

3.1 Basic Requirements for Being An Oracle

We impose two requirements for an LLM to be used as an oracle in our OC scheme. First, the LLM needs to have the ability to perform a validity check on the answer it provides. For each RDF Fact, our simple checker asks the LLM to perform an entailment check, asking whether or not the Fact can be logically entailed from the paragraph (and if BFs are provided, with the BFs as well). If this check passes for every Fact in the RDF, the checker accepts. Otherwise, it rejects. Then, the checker reports the result to the human verifier for review.

The second requirement is that the oracle must be able to demonstrate a systematic behavior in NN repeated runs. Because of the probabilistic nature of an LLM, it is possible that in repeated runs, no two answers are exactly the same. In this case, we consider the LLM failing the systematic requirement. In our experiment, if the LLM could produce at least two exact same answers in 10 repeated runs, we consider it satisfying the systematic requirement for the given task.

4 Feasibility Study

We focus our discussion with paragraphs from the first two chapters of the RISC-V Spec. The first chapter provides a general introduction. The second chapter provides specification of the instructions from the RV32I integer instruction set. The rest of the chapters are similar to the second chapter, providing specification for a particular instruction set defined in RISC-V. The example in Figure 1 is from chapter 2. Because the descriptions from chapter 1 are more high-level, we expected that KGC would be more difficult for those paragraphs. However, as our analysis will show later, we find no significant difference on the GPT3.5’s performance for paragraphs from the two chapters.

4.1 Consistency Check

For checking the systematic requirement, included in our simple checker is a consistency check. We repeat the same prompt 10 times and check to see if at least two RDFs are exactly the same. Before checking consistency, we also implement an RDF syntactic check using a publicly-available RDF parser Boettiger (2018). If an RDF fails the syntactic check, it is excluded for the consistency check.

Figure 4: Results of consistency check without BFs provided; \mdblksquare\mdblksquare: Most Consistent Group; \mdblksquare\mdblksquare: Failed
Figure 5: Results of consistency check with BFs provided; \mdblksquare\mdblksquare: Most Consistent Group; \mdblksquare\mdblksquare: Failed

For paragraphs in the two chapters, Figure 5 shows the results of consistency check as two bar charts, for chapters 1 and 2 respectively. These results were obtained without BFs. The result of each paragraph may comprise multiple colored bars. Each color represents a group of RDFs that are exactly the same. A dark bar ( \mdblksquare\mdblksquare) shows those runs failing the syntactic check. Each orange bar ( \mdblksquare\mdblksquare) denotes the largest group of RDFs that are exactly the same (i.e. the most consistent group).

Below some of the bars, there are text notes. Each note means that for the original paragraph, multiple KGC trials failed in all 10 runs. Then, we split the paragraph into multiple sub-paragraphs to be processed separately. For example the first note is “P10(3)” indicating that paragraph 10 was split into 3 sub-paragraphs in the experiment. We will discuss this splitting strategy later. However, notice that some of the sub-paragraphs still fail the syntactic check even after the splitting.

Figure 5 demonstrates that in general GPT3.5 does have a systematic behavior for KGC. For those failing cases, we then rely on using BFs to resolve them. It is important to note that this consistency check says nothing about the quality of the resulting RDFs. This assessment is done afterwards.

4.2 The Effect of Providing BFs

Figure 5 shows the result of consistency check after BFs are provided. For the two chapters, we had in total 204 BFs (see Appendix A.8). In Figure 5, there is no paragraph with a complete fail any more. In the worst case, we obtained two RDFs that are exactly the same. Based on the results, we can choose the RDF from the most consistent group (the \mdblksquare\mdblksquare group) to perform the entailment check.

4.3 Entailment Check

Figure 6: Two prompts used in the entailment check

Each entailment check is carried out with two prompts. The first prompt (Prompt A in Figure 6) asks GPT3.5 to convert a Fact into a sentence. The second prompt (Prompt B) then asks GPT3.5 whether or not the given paragraph (and background facts if available) logically entails the statement of fact. In Figure 6, Prompt B is combined with query 1 or query 2 to form two different prompts, one without BFs and the other with BFs. It should be noted that given an RDF, the entailment check is applied to each Fact individually. Recall that a Fact is an rdf block that may include multiple triples.

Figure 7: Results of entailment check for the two chapters of paragraphs; Each chart shows overlapping of two results from the runs without and with BFs provided; \mdblksquare\mdblksquare: showing % of RDF Facts passing the check where those Facts are obtained with no BFs provided; \mdblksquare\mdblksquare: showing % of RDF Facts passing with BFs provided; \mdblksquare\mdblksquare: With BFs provided, some Facts fail the check (mostly because of including the auxiliary entities not given in the paragraph) and are bypassed after manual review.

Figure 7 summarizes the entailment check results for paragraphs from the two chapters separately. The vertical axis shows the entailment score, a value between 0 and 1: Assuming an RDF contains NN Facts. The ratio between the number of passing Facts and NN is used as the score.

Each bar in Figure 7 corresponds to the result from one paragraph and can include three colors. The \mdblksquare\mdblksquare color bars are based on the RDFs obtained without BFs. If a bar has only this color, it means that adding BFs is not necessary for passing the check. We may consider them as the “easy” cases.

For those cases where the \mdblksquare\mdblksquare color bars do not reach the score 1.0, we then rely on BFs for bringing the entailment check score to 1.0. For those that do not show up with a \mdblksquare\mdblksquare color bar at all, they are the “hard” cases. Without BFs, there is no Fact passing the entailment check.

The \mdblksquare\mdblksquare bars then correspond to the entailment scores based on the RDFs obtained with BFs provided. Note that these bars are shown behind the \mdblksquare\mdblksquare bars. Consequently, if an original \mdblksquare\mdblksquare color bar already reaches the score 1.0 or if the \mdblksquare\mdblksquare bar is shorter than the \mdblksquare\mdblksquare bar, then the \mdblksquare\mdblksquare bar cannot be seen.

For some paragraphs, we need the \mdblksquare\mdblksquare bars to bring the score to 1.0. They represent Facts reported by the simple checker as failing the entailment check after BFs are provided. However, after manual review these failures are bypassed. Those failures can be divided into three categories where the first one happens most frequently and the third happens only on very few cases. See Appendix A.2 for some examples to illustrate these three categories.

The first category is the creation and use of auxiliary entities in the RDF. An auxiliary entity is the one that does not appear in the paragraph (nor the BFs) and is created to facilitate describing other entities. In a sense, we can consider those auxiliary entities as BFs automatically supplied by GPT3.5. Because they are not mentioned in the paragraph and the BFs, Facts involving them would fail the entailment check. However, the GPT3.5’s ability to add auxiliary entities (i.e. its own BFs) can be quite desirable, because it helps provide BFs possibly missed by our manual preparation of BFs.

The second category involves an entity or a predicate that has nothing wrong by itself. However, in the RDF the entity/predicate is specified within a particular namespace (e.g. “riscv:”, “rdf”). Because the original paragraph (and the BFs) do not explicitly state their use in the namespace, this can cause the entailment check to fail.

The third category, happening only on few cases, involves the use of namespaces other than rdf or risvc namespaces. In the Spec, there are some descriptions about other ISA architectures (MIPS, SPARC, etc.). Those descriptions may result in the creation of their respective namespaces. In an RDF, entities from two namespaces might be connected through a predicate. This can cause a problem for the entailment check as the original paragraph (and the BFs) simply considers those terms as entities rather than different namespaces.

In summary, Figures 5-7 shows that it is feasible to use GPT3.5 as an oracle. Further, providing BFs can improve consistency and also help reach a satisfactory RDF for every paragraph.

Figure 8: No correlation between the size of largest group from consistency check and the entailment check score

It should be noted that more consistency does not mean more likely to pass the entailment check. Figure 8 illustrates this point. The x-axis is the size of largest consistent group. The y-axis is the entailment score. Every dot is a paragraph. The RDF is the one obtained without BFs provided (i.e. Figure 5). As seen, the results of consistency check and of entailment check have no obvious correlation.

5 Analysis of Results

We let BFϕ\phi refer to the KGC without BFs, and BFAA refer to the KGC with BFs added. Further, BFϕ\phi-Pass refers to those paragraphs whose RDFs pass the entailment check, and BFϕ\phi-Fail refers to those failing the check. Next we focus on comparing the the changes of RDFs from BFϕ\phi to BFAA.

There can be several metrics to measure the RDF changes from BFϕ\phi to BFAA, including: (1) the number of entities (entity coverage), (2) the set of subject entities, (3) the number of triples (fact coverage), and (4) the number of predicates (relations or features or both). In addition, we can consider those metrics separately from Facts passing and from Facts failing the entailment check.

From the KGC process, in addition to consistency we can consider another metric measuring the variations of subject entities across the 10 RDFs. We use a conformity score defined as the ratio between the sum of individual numbers of subject entities and the size of the union set of the subject entities, across the 10 RDFs. A score 10 means every RDF uses the same set of subject entities. A score 1 means everyone uses a different set. We found no obvious trend based on the changes of conformity scores from BFϕ\phi to BFAA. We also found no obvious correlation between the conformity score changes and the changes on the number of entities from BFϕ\phi to BFAA (see Section A.3).

In general we found the number of predicates a less effective metric. This is because the same predicate can be used in many triples. Hence the number of predicates used in an RDF reveals little information about each predicate’s usage frequency.

Note that for a given metric, separating its numbers based on passing and failing the entailment check can be difficult. This is because entailment check is applied to a Fact. As a result, pass/fail is applied to the entire Fact and may not reflect the pass/fail of a particular triple, entity, or predicate.

We did find that separating the scenario BFϕ\phi-Pass from the scenario BFϕ\phi-Fail was helpful, enabling us to observe some interesting trends.

5.1 Entity Coverage

Figure 9: Comparing entity coverage from BFϕ\phi to BFAA for the two scenarios: BFϕ\phi-Pass and BFϕ\phi-Fail

Figure 9 shows an interesting trend. For most of the cases in the BFϕ\phi-Pass category, the BFAA produces an RDF that contains more number of entities than the corresponding BFϕ\phi’s RDF, i.e. improving the entity coverage. Intuitively, BFAA keeps most of the entities from BFϕ\phi-Pass because Facts involving them pass the entailment check already (they are “correct” and no need to replace them).

In contrast, for those paragraphs in the BFϕ\phi-Fail category, the effect can go either ways, i.e. the entity coverage can increase or decrease. The RDFs in the BFϕ\phi-Fail category include Facts that fail. BFAA intends to fix those problems. This fixing can involve finding a different set of subject entities, resulting in using more or using less entities.

5.2 Fact Coverage

Although we generally expect that using more entities would result in more triples in the RDF, Figure 10 and Figure 11 show that it is not always the case. These figures plot the number of entities against the number of triples. An arrow shows the change from BFϕ\phi to BFAA. Although we expect most cases fall into the UU (both numbers increase) and DD (both numbers decrease) categories, we do see cases in the other two categories.

Figure 10: For those paragraphs where BFϕ\phi passes the entailment check, adding BFs generally results in not only more entities but also more triples included in the RDF (the UU chart); The first U (D) stands for # of entities going up (down) and the second U (D) stands for # of triples going up (down)

There are two cases shown on the UD chart in Figure 10. The top arrow is associated with the paragraph discussed earlier in Figure 1 and Figure 2. The case associated with the second arrow is similar (see Appendix A.4 for more of its discussion).

Figure 11: For those paragraphs where BFϕ\phi fails the entailment check, adding BFs generally results in RDFs falling into either the UU or the DD categories; There are several exceptions in the UD and DU categories

Like the BFϕ\phi-Fail chart shown in Figure 9 where the changes from BFϕ\phi to BFAA can go both ways, Figure 11 shows that the changes when considering the number of triples can go four ways. Most still follows our expected trends as UU and DD. However, there are quite a few in the UD and DU categories. As BFϕ\phi’s RDFs failed the entailment check, the ways BFAA fixed them can be diverse, resulting in the four different scenarios as seen.

5.3 Root Subject Entity (RSE) Carry-Over

A root subject entity (RSE) is a subject entity that does not appear in the object field of any triple in a given RDF. Root entities can be thought of as the main topics extracted by the oracle. It is interesting to analyze how many root entities from BFϕ\phi are kept in the RDF of BFAA. We use a percentage number (0 to 1) to capture the extent of this RSE carry-over from BFϕ\phi.

Figure 12: % of root subject entities (RSEs) carried over from the RDF of BFϕ\phi into the RDF of BFAA

Figure 12 shows the RSE carry-over, separated again by the two scenarios, BFϕ\phi-Pass and BFϕ\phi-Fail. It can be seen that for majority of cases in the BFϕ\phi-Pass category RSE carry-over is 100%. This means that all RSEs used in the BFϕ\phi’s RDF are kept in the RDF of BFAA (BFAA may contain many other RSEs though). For those cases, what the BFAA primarily doing is to expand the fact coverage, i.e. finding additional RSE and generating more triples.

In addition to 100% carry-over, the other two cases (partly and no carry-over) can be due to various reasons. They generally indicate that BFAA finds a different set of main topics for the KGC.

On the BFϕ\phi-Pass chart in Figure 12, there are four cases where the RSE carry-over is 0. Their reasons vary. Appendix A.5 provides their details for further reference.

As for the cases in BFϕ\phi-Fail, the effects of adding BFs can be diverse. Appendix A.6 provides two examples to illustrate the effects.

5.4 Merging Individual RDFs

With BFs added, RDFs obtained by BFAA for individual paragraphs would share more entities that enable more connections among the paragraphs. This finding is illustrated with additional analyses in Appendix A.7. From the perspective of merging individual RDFs into a single KG, it is interesting to note that before BFs was provided, the RDFs from BFϕ\phi already had many shared entities (if we do not consider whether or not they pass the entailment check). BFs further added more sharing to the resulting RDFs. With this observation, we see that the primary objective to add BFs is for fixing the problems seen in the RDFs from BFϕ\phi. Improving entity sharing becomes secondary.

6 The Tasks of Human Verifier

There are two primary tasks for the human verifier: (1) deciding when to split a paragraph into sub-paragraphs, and (2) deciding what BFs to provide.

Empirically, splitting a paragraph can be due to three reasons: (1) When a paragraph is too lengthy (e.g. a long paragraph followed by a list of bullet points), splitting can help. This is the obvious case. (2) If a paragraph contains a question raised, followed by an answer author provides, then it is helpful to separate the question from the answer. (3) When a paragraph contains statements regarding other architectural domains (e.g. MIPS, SPARC, etc), it is helpful to separate those non-RSIC-V descriptions from the RISC-V descriptions.

To our experience, the results from BFϕ\phi can be informative for choosing BFs. Further, preparing BFs can follow several guidelines: (1) For an abbreviation or convention defined earlier or somewhere else, a BF needs to be provided. If not, GPT3.5 may supply its own meaning which can be wrong. (2) When we want to guide the KGC to focus on an entity as the main topic, we can provide a BF about the entity (see the example in Section A.6.1). (3) When the paragraph involves several concept described in long phrases, providing BFs that emphasize each long phrase as a single concept can help reduce the KGC complexity (see the example in Section A.6). (4) When we desire to include a missing fact detail involving an entity, we can “remind” GPT3.5 by providing a BF about the entity (e.g. Figure 2). However, this might not always work as occasionally GPT3.5 can choose ignore a BF completely.

In general, we found that adding too many unnecessary BFs could degrade the quality of the resulting RDF. Hence, the BFs should not be excessive. We also found that we should avoid adding a BF describing a relation between two entities which are already related by the paragraph. This superimposition of relations could cause the KGC process to fail or even enter a repeated loop.

7 Limitations

As the first step, our work shows the feasibility of the proposed OC scheme. We have not demonstrated how to improve its automation. Manual efforts are required to prepare the BFs and review the entailment check results. Although we have seen several guidelines to prepare BFs, the manual process remains ad hoc at this stage.

Note that with more familiarity of the content described in the Spec, we found that preparing the BFs could be done quickly (minutes for a section). Reviewing the results from the entailment check was not a time-consuming task either. The most time-consuming task was to deal with those difficult cases when the RDF remained unsatisfied after several trials. We could spend over an hour on each of those cases (counting the GPT3.5 calling time).

8 Conclusion and Future Work

This work demonstrates the feasibility to use GPT3.5 for KGC on texts in the semiconductor chip design domain. We propose an OC scheme to use GPT3.5 as an oracle and the essence is using BFs to influence the oracle’s behavior. The communications to GPT3.5 are through a simple checker that imposes two checks: the consistency check and the entailment check. We consider them as requirements for an LLM to be used as an oracle.

Our current OC scheme requires a human verifier whose primary job is to review the results of the entailment check and prepare BFs. To what extent this manual process can be automated and how to do it are interesting questions for further research.

References

  • Arora and Barak (2012) Sanjeev Arora and Boaz Barak. 2012. Chapter 8 Interactive Proofs in Computational Complexity: A Modern Approach. Cambridge University Press.
  • Beckett and Berners-Lee (2011) David Beckett and Tim Berners-Lee. 2011. Turtle - terse rdf triple language.
  • Bird (2006) Steven Bird. 2006. NLTK: The Natural Language Toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 69–72, Sydney, Australia. Association for Computational Linguistics.
  • Boettiger (2018) Carl Boettiger. 2018. rdflib: A high level wrapper around the redland package for common rdf applications.
  • Bommasani et al. (2021) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, and et al. 2021. On the opportunities and risks of foundation models. CoRR, abs/2108.07258.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
  • Chen and Manning (2014) Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar. Association for Computational Linguistics.
  • Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 363–370, Ann Arbor, Michigan. Association for Computational Linguistics.
  • Harnoune et al. (2021) Ayoub Harnoune, Maryem Rhanoui, Mounia Mikram, Siham Yousfi, Zineb Elkaimbillah, and Bouchra El Asri. 2021. BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Computer Methods and Programs in Biomedicine Update, 1:100042.
  • Liao and Veeramachaneni (2009) Wenhui Liao and Sriharsha Veeramachaneni. 2009. A simple semi-supervised algorithm for named entity recognition. In Proceedings of the NAACL HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pages 58–65, Boulder, Colorado. Association for Computational Linguistics.
  • Lu et al. (2021) Yaojie Lu, Hongyu Lin, Jin Xu, Xianpei Han, Jialong Tang, Annan Li, Le Sun, Meng Liao, and Shaoyi Chen. 2021. Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2795–2806, Online. Association for Computational Linguistics.
  • Luan et al. (2018) Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3232, Brussels, Belgium. Association for Computational Linguistics.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
  • Milne and Witten (2008) David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, page 509–518, New York, NY, USA. Association for Computing Machinery.
  • Mondal et al. (2021) Ishani Mondal, Yufang Hou, and Charles Jochim. 2021. End-to-end construction of NLP knowledge graph. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1885–1895, Online. Association for Computational Linguistics.
  • (16) OpenAI. Gpt-3.5-turbo.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
  • Paice (1990) Chris D. Paice. 1990. Another stemmer. SIGIR Forum, 24(3):56–61.
  • Shin et al. (2021) Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  • Waterman and Asanovi´c (2019) Andrew Waterman and Krste Asanovi´c. 2019. The risc-v instruction set manual, volume i: User-level isa, document version 20191213.
  • Wolfson et al. (2020) Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
  • Yan et al. (2018) Jihong Yan, Chengyu Wang, Wenliang Cheng, Ming Gao, and Aoying Zhou. 2018. A retrospective of knowledge graphs. Frontiers of Computer Science, 12:55–74.
  • Ye et al. (2022) Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun Chen. 2022. Generative knowledge graph construction: A review. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1–17, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Zelenko et al. (2002) Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel methods for relation extraction. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 71–78. Association for Computational Linguistics.
  • Zelenko et al. (2004) Dmitry Zelenko, Chinatsu Aone, and Jason Tibbetts. 2004. Coreference resolution for information extraction. In Proceedings of the Conference on Reference Resolution and Its Applications, pages 24–31, Barcelona, Spain. Association for Computational Linguistics.
  • Zhu et al. (2013) Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 434–443, Sofia, Bulgaria. Association for Computational Linguistics.

Appendix A Appendix

A.1 Examples of Difficult Cases Without BFs

We use two examples to show that not all paragraphs are as easy to handle by GPT3.5 as the paragraph used in Figure 1 and Figure 2.

A.1.1 The RDF Failing The Syntactic Check

The paragraph of the first example is shown in Figure 13. Similar to the paragraph shown in Figure 1, this paragraph was also about defining several instructions. It turned out that this paragraph caused some issue in the RDF syntactic check. Without BFs provided, all RDFs from 10 repeated trials failed the syntactic check (The problem came from its choice to use “XORI rd, rs1, -1” as a single entity).

Figure 13: A problematic example for GPT3.5 to construct the KG using the same prompt in Figure 1 without using any background fact, even though the paragraph is also about defining instructions like the one in Figure 1

We then provided the BFs as shown in Figure 2 for “immediate”, “rs1” and “rd”. We used the same prompt as that shown in Figure 2. The resulting PDF is shown in Figure 14. The paragraph defined four instructions covered by four Facts in the RDF.

Figure 14: The RDF obtained for the paragraph in Figure 13

A.1.2 Unsatisfactory RDF

Figure 15 shows another paragraph that is difficult for GPT3.5 to handle. In repeating 10 requests using the same prompt as that in Figure 1, only one succeeded with a RDF that passes the syntactic check. The one result is shown in Figure 16.

Figure 15: A difficult example for GPT3.5 to construct the KG using the same prompt in Figure 1 without supplying any background fact. The main subject entities of the paragraph are the four ISA variants highlighted (RV32I, RV64I, RV128I, RV32E). A good KG constructor is required to uncover their relations and the features to describe their differences.

The RDF was not satisfactory because the paragraph says that the four variants were differentiated by two aspects: the size of the address space and the number of integer registers. The RDF represented the second aspect explicitly but did not mention anything about the address space. Furthermore, the paragraph did not say that RV64I and RV128I both had 32 integer registers like RV32I.

Figure 16: One RDF output given for the paragraph in Figure 15 using the same prompt as that in Figure 1

Using the same prompt in Figure 2, we used the BFs in Figure 17 for the paragraph shown in Figure 15. The most consistent RDF is shown in Figure 18. The consistency check was explained in Section 4.1).

Figure 17: BFs provided to obtain result in Figure 18

Comparing to the result in Figure 16, the new RDF captured many more facts (i.e. triples). Notice that the four subject entities Were the same between the two RDFs. The new RDF included the triple “(RV32E, subClassOf, RV32I)” which was missing previously. This was a correct relation between the 32E variant and the base 32I.

The new RDF also included the entity addressSpace which was missing in the previous result. Instead of using the long predicate has_Number_of_Integer_Resgiters as that in Figure 16, the new RDF used a less specific entity integerRegister and represented its length with XLEN in bits (as it should). It also included a comment stating that 32E had half the number of integer registers. Overall, the new RDF included many more correct facts from the paragraph.

Figure 18: The RDF result for the Figure 15 example

It was interesting to note that the paragraph in Figure 15 did not explicitly state the number of registers for the four base ISAs. This was why the RDF in Figure 18 did not include this information. In the Spec, this information was actually stated in another paragraph 62 paragraphs later. Figure 19 shows the sentence and the corresponding rdf block.

Figure 19: The number of integer registers is explicitly stated in a sentence later separated by 62 paragraphs from the paragraph shown in Figure 15

The two RDFs in Figure 18 and Figure 19 can be connected through the same entity referring to the “integer register.” Because in our KGC each paragraph (i.e. “a text item”) is handled separately, including a more generic entity in the RDF is crucial from the perspective of merging individual RDFs into a whole knowledge graph. From this perspective, Figure 16 is less desirable even if it were correct.

A.2 Examples Failing The Entailment Check

For the two charts shown in Figure 7, we used six examples (three from each chart) to illustrate what happened to those Facts represented by the \mdblksquare\mdblksquare bars. On each chart, those \mdblksquare\mdblksquare bars appeared in three of the last four groups from the right. For each group, we selected the first paragraph from the right as the example for the group. We started with the first paragraph from the right in the second chart as the first example, and moved left to the other five examples. On each example, we showed the paragraph followed by the Facts that failed the entailment check.

A.2.1 First Example

Figure 20: The three Facts fail the entailment check because of the ISA_property created

The paragraph of the first example mentioned three types of impacts: on the “code size”, on the “performance”, and on the “energy consumption”. In the RDF, those three impacts were treated as three separate subject entities. However, in order to group them together a separate entity ISA_property was created. The entailment fails because the paragraph did not mention specifically that the RISC-V domain involved an entity called ISA_property. In other words, ISA_property can be seen as an auxiliary entity (or concept) created to facilitate organization of other facts in the RDF. Auxiliary entities are not explicitly specified and hence, their use can fail the entailment check.

From a practical point of view, we consider automatic creation of auxiliary entities and their proper use a desirable capability of GPT3.5 for KGC. Because of this capability, we do not need to explicitly state all BFs that might be used to interpret and/or organize the entities (concepts) discussed in a given paragraph. For example, we might provide the BFs explicitly stating that the three different impacts are a type of some property to facilitate their organization. However, we did not need to because GPT3.5 automatically created them.

A.2.2 Second Example

Figure 21: In this example, Facts 1–3 failed the entailment check because the auxiliary entities JALInstruction, ImmediateType and PipelineConcept created, respectively. Additionally Fact 2 failed because the use of the rdf predicate subClassof.

The reason for the second example failure was similar to the first example discussed above. In addition, the entailment check complained about the use of the rdf predicate subClassof, i.e. there was no specific definition about the particular class hierarchy in the RISC-V domain. Again, using the rdf predicate subClassof to organize entities (concepts) into a hierarchy can actually be helpful. During our manual review, if an entailment check failed because of this reason, we usually ignored it.

A.2.3 Third Example

Figure 22: In this example, Fact 1 failed the entailment check because the auxiliary entity BranchTarget.

For the third example, the reason of failing the entailment check was the same as that in the previous two examples. It was interesting to observe that the auxiliary entity BranchTarget was created to classify the term “overflow”. Indeed, the “overflow” was the branch target of the “blt” instruction in this case.

A.2.4 Fourth Example

Figure 23: In this example, the Facts fail because of the various namespaces (“mips:”, “sparc:”, and “isa:”) in use

The fourth example shows that entailment check can fail due to an auxiliary namespace created. The RDF in Figure 23 shows two namespaces created: mips: and sparc:. In order to organize these two namespaces, an auxiliary namespace isa: was used. The paragraph did not specify explicitly the namespaces and the relations of terms in those namespaces. They were created to facilitate representation of the information. We observed in our experiment that when a fact involved a relation across two namespaces, it can cause a failure in the entailment check because the original paragraph did not concern about the namespaces created (i.e. those names were simply “concepts”, not namespaces).

A.2.5 Fifth Example

Figure 24: This example shows the creation of another type of auxiliary entities which we can call them “intermediate” subject entities. These entities are not mentioned in the paragraph and are created to represent the intermediate concepts that can be used to describe other entities

In the fifth example, four auxiliary subject entities Address_1, Address_2, Bits_1, and Bits_2 were created to describe other entities. It was interesting to notice that these four entities corresponded to the “first halfword address” (byte 0), “second halfword address” (byte 2), “first halfword” (bits 15-0), and “second halfword (bits 31-16), respectively. We can call them “intermediate” entities as they are created to represent intermediate concepts used to describe other entities. These auxiliary entities did not appear in the paragraph (nor in the BFs). Hence, entailment check involving them would fail.

A.2.6 Sixth Example

Figure 25: This example shows that entailment check can fail due to not explicitly specifying a predicate in the given domain. In this case, the refersTo predicate is not explicitly specified as a predicate in the RISC-V domain

In the sixth example, the BFs provided further information about the term “trap”. However, it did not explicitly define that “refers to” was a predicate specific in the RISV-C domain. This type of domain restriction can cause the entailment check to fail, i.e. the original statement talked about a general term while the RDF restricted the term to be specific to the RISC-V domain. Our manual review bypassed this type of entailment check failures.

A.3 Variations in Subject Entities

Figure 26: Conformity measuring the variations of subject entities across 10 repeated runs: BFϕ\phi Vs. BFAA

Figure 26 shows that there is no obvious trend on the changes of conformity scores from BFϕ\phi to BFAA. Every dot in this figure represents a paragraph.

Figure 27: No correlation between the change of conformity score and the change of the number of entities after BFs added

Figure 27 further shows that there is no correlation between the changes of the conformity score and the changes of the number of entities used in the RDF, as we move from BFϕ\phi to BFAA.

A.4 Case Discussion from Figure 10

Figure 28: The paragraph associated with one of the two arrows shown on the UD chart in Figure 10

For the two arrows shown on the DD chart in Figure 10, the paragraph associated with one of the arrows is the one already explained with Figure 1 and Figure 2. The other paragraph is shown in Figure 28. Figure 30 shows the corresponding RDF from BFϕ\phi and Figure 30 shows the RDF from BFAA.

Figure 29: The BFϕ\phi’s RDF for the paragraph shown in Figure 28, containing 19 triples
Figure 30: The BFAA’s RDF for the paragraph shown in Figure 28, containing 18 triples

As seen, the BFAA’s RDF used more entities (including those represented as strings) but one less triple. The two RDFs essentially captured the same facts but represented them differently.

Notice that the provided BFs enabled BFAA’s RDF to connect the particular exception concept, InstructionAddressMisalignedException, to the hardware thread, Hart. As the concept “hart" was discussed in other paragraphs, this connection facilitated merging of KGs between this paragraph and those other paragraphs.

A.5 Four Cases With RSE Carry-Over =0=0

There are four cases whose RSE carry-over is 0 on the BFϕ\phi-Pass chart in Figure 12. We provide their details in this section. Figure 31 shows the four paragraphs and the changes of entities in use from BFϕ\phi to BFAA (from left to right as pointed by the arrow). The BFϕ\phi’s RDFs and BFAA’s RDFs for the four cases are then shown in Figures 32, 33, (34, 35), and 36, respectively.

Figure 31: More details on the four cases whose RSE carry-over is 0 on the BFϕ\phi-Pass chart in Figure 12

The first case was a single sentence. Even without the BFs provided, GPT3.5 understood what the four base ISAs were for RISC-V. To represent the information that “the four base ISAs were distinct base ISAs”, the BFϕ\phi’s RDF explicitly represented the facts that they were mutually distinct. It used the relation isDisinctFrom for this purpose. The RSE was RISC-V used to group the four base ISAs together.

Figure 32: Comparison of BFϕ\phi’s RDF and BFAA’s RDF for the first case in the table shown in Figure 31
Figure 33: Comparison of BFϕ\phi’s RDF and BFAA’s RDF for the second case in the table shown in Figure 31
Figure 34: BFϕ\phi’s RDF for the third case in the table shown in Figure 31
Figure 35: BFAA’s RDF for the third case in the table shown in Figure 31
Figure 36: Comparison of BFϕ\phi’s RDF and BFAA’s RDF for the fourth case in the table shown in Figure 31

In the BFAA’s RDF, the four base ISAs each became an RSE. The RDF used a feature called isDistinctBaseISA to note the four base ISAs. The information was represented implicitly, resulting in use of different RSEs and a more compact RDF.

For the second paragraph shown in Figure 31, the main subject was the term IALIGN. The BFϕ\phi’s RDF captured this main subject correctly. However, the second sentence in the paragraph mentioned that “compressed ISA extension relaxes IALIGN to 16 bits”. This fact was not captured in the BFϕ\phi’s RDF. Instead, it used a more generic relation hasISAExtension to relate IALIGN to compressedISAExtension.

The BFAA’s RDF captured the second fact by creating the subject entity compressedISAExtension and using the relation relaxes to relate the entity to IALIGN. The IALIGN was still kept as a subject entity. However, because it appeared in the object field of a triple, IALIGN was no longer counted as a root subject entity. It was also interesting to notice that in the BFAA’s RDF the second Fact explicitly enumerated a few example values that the IALIGN cannot take on. There was nothing incorrect by adding this extra information to the RDF.

Figures 34 and Figures 35 then show the comparison for the third paragraph in Figure 31. The main subject of this paragraph was about “invisible trap”. The paragraph provided three example cases where this type of trap can take places: missing instructions, non-resident page faults, and device interrupts. Comparing BFϕ\phi’s RDF to BFAA’s RDF, we observed that they both included the entities to represent the three examples. However, the entities in the BFϕ\phi’s RDF attached prefixes “emulating” and “handling” to the entities while the BFAA’s RDF just used the three example names as given. The RSE of the BFϕ\phi’s RDF was Trap (we might see it as an instance) which was declared as an InvisibleTrap. In contrast, the three example cases were all treated as RSEs in the BFAA’s RDF, which were all declared as a subclass of InvisibleTrap. Then, the InvisibleTrap was declared as a Trap (we might see it as a concept) which is more intuitive. Overall, we observed that the BFAA’s RDF was easier to interpret than the BFϕ\phi’s RDF.

With the BFs provided, the BFAA’s RDF further connected Trap to EEI and provided a Fact on the ExecutionEnvrionment that was connected to EEI and hart. These connections were useful when we merged this paragraph-specific KG with the KGs from other paragraphs that provided more detail on the EEI and hart.

The fourth example is shown in Figures 36. The BFϕ\phi’s RDF used RV32E as the RSE while the BFAA’s RDF used Chapter4 as the RSE. Both RDFs contained the same information from the paragraph. The BFAA’s RDF had extra information on RV32I due to the BFs provided.

A.6 Two Additional Interesting Cases

In this section, we discuss two more cases in detail to highlight the effect from BFϕ\phi to BFAA. These two cases were among the more challenging cases we had encountered. The BFϕ\phi’s RDFs did not pass the entailment check and we had to carefully select the BFs to obtain a satisfactory RDF with BFAA.

Figure 37: The paragraph for the first example
Figure 38: The main body of the RDF created by BFϕ\phi for the paragraph in Figure 37
Figure 39: The remaining RDF continuing from Figure 38

The paragraph for the first example is shown in Figure 37. This paragraph was challenging because it involved several specific concepts. Take the first phrase “avoid intermediate instruction sizes” as an example. This phrase can include the concept “instruction sizes”, the concept “intermediate instruction sizes” and the concept “avoid intermediate instruction sizes”. Choosing which concept to start can affect the remaining KGC process.

The RDF from the BFϕ\phi is shown in Figure 38 continuing in Figure 39. As seen, this RDF defined various verbs (“adopted”, “avoid”, “helps”, and so on) as rdf property and used them as relations to connect the various entities. These property definitions all failed the entailment check. The RDF even created an entity called wanted to capture the fact that “intermediate instruction sizes” were wanted to be avoided (by the RISC-V ISA developer).

Figure 40: The main body of the RDF created by BFAA for the paragraph in Figure 37
Figure 41: The remaining RDF continuing from Figure 40

The BFs provided to BFAA included three facts specifying that “avoid intermediate instruction size”, “base hardware implementation”, and “larger number of integer registers” all should be treated as a single concept. The result RDF is shown in Figure 40 (continuing in Figure 41) where the three were created as subject entities, in addition to the subject entity 32_bit_instruction_size whose equivalent form also appeared in the BFϕ\phi RDF before (Figure 38).

In the BFAA’s RDF, instead of treating verbs as rdf property like that in the the BFϕ\phi RDF (Figure 39), three verbs (supports, helps_performance, and simplifies) were defined with “rdfs1:label”. Then, they were used to connect the four subject entities, which more accurately reflected the information conveyed by the paragraph.

A.6.1 The Second Example

Figure 42 shows the paragraph of the second example. The BFϕ\phi’s RDF is shown in Figure 43 continuing in Figure 44.

Figure 42: The paragraph for the second example

Similar to the previous example, this paragraph contained phrases where multiple choices of entities might be extracted. For example, the phrase “increasing address space size” might be represented by the phrase itself, “address space”, Or “address space size”. As seen in Figure 43, BFϕ\phi chose the long names increasing_address_space_size and supporting_running_existing_binaries, and treated them as rdf property (Figure 44).

The RDF in Figure 43 contained two Facts, one for mips and the other for sparc. The two essentially Were represented in the same way. It was interesting to see that each Fact contained duplicated comments and labels, one with rdf namespace and the other with rdfs1 namespace. The texts with the comments and labels were additional information provided by GPT3.5 itself. Again, this shows that GPT3.5 has the ability to provide its own background facts.

Figure 43: The main body of the RDF created by BFϕ\phi for the paragraph in Figure 42
Figure 44: The remaining RDF continuing from Figure 43

The three property-based Facts in Figure 44 did not pass the entailment check. While BFϕ\phi’s RDF treated mips and sparc as two main subjects, it did not capture what “strict superset policy” meant other than it is a property used to relate the ISA to supporting_running_existing_binaries.

After analyzing the paragraph more carefully, we found that the main topic was about the “strict superset policy”. MIPS and SPARC were just examples that adopt this policy. The BFϕ\phi did not capture this point. Consequently, in BFAA we added a background fact to state that "strict superset policy" refered to a choice of ISA design. In addition, we also added a background fact about the “address space”. The resulting RDF is shown in Figure 45. This RDF is more compact than the BFϕ\phi’s RDF shown before.

Figure 45: The BFAA’s RDF for the paragraph in Figure 42

It was interesting to notice that instead of treating "strict superset policy" as an entity, BFAA used the entity superset_policy. It was declared as a type of DesignPolicy based on the background fact we provided. Instead of using the longer entity increasing_address_space_size as before, it now uses a simpler (and more generic) entity address_space as provided in our BFs. Note that “address space” was a concept frequently mentioned in other parts of the Spec. Hence, using this entity can help make connections to the KGs of other paragraphs later.

The two main facts for mips and sparc essentially said that the ISA had a design policy called “superset policy” that involved using 64-bit address space to support 32-bit binaries. This representation reflected the information conveyed by the paragraph better than the BFϕ\phi’s RDF shown before.

The two examples discussed in this section can be seen as two representative cases that are opposite to each other. The first example shows that BFAA moves from using shorter names as that in BFϕ\phi to using longer names, resulting in more complex RDF and uncovering more relations (i.e. more coverage). In the second example, BFAA moves from using longer names to shorter names, resulting in more compact RDF by focusing on the right topic and the use of more generic entities (i.e. less coverage but more accurate).

In general, the changes from BFϕ\phi to BFAA can be diverse. These changes not only depend on how the BFs influence the GPT3.5’s KGC behavior, but also depend on the paragraph being processed.

Figure 46: Top (BFϕ\phi): 67 subject concepts shared by at least the RDFs of two paragraphs with the total number of 320 edges shown in the bipartite graph ; Note that this result includes all RDFs even though they fail the entailment check; Bottom (BFAA): 75 concepts shared by at least two paragraph and the total number of edges increases to 454; In the bipartite graphs, the bottom dots each represents a paragraph from chapter 1 and the upper squares each represents a subject concept. More transparent the color indicates more edges are connected between the concepts and paragraphs.

A.7 Cross-Paragraph Connections

A total of 433 and 452 subject entities were extracted by BFϕ\phi for chapters 1 and 2, respectively (disregarding the entailment check results). In contrast, a total of 597 and 577 subject entities were extracted by BFAA for chapters 1 and 2, respectively.

To show how many paragraphs can be connected through the subject entities, we first used a simple method to group subject entities into high-level subject concepts. If subject entities shared the same suffix word, we grouped them together. For example, “CSRInstruction" and “StoreInstruction" were grouped with the high-level concept “Instruction". Suffix can be easily split from the entity phrase since RDF already formated the phrases into Camel or Snake case. We further used a stemming tool Paice (1990) to remove morphological affixes from the suffix words so that words with the same stem would be grouped together. For example, “encodings" and “encodes" belonged to the same group.

For chapter 1, we collected a total of 168 and 178 subject concepts from BFϕ\phi and BFAA, respectively. These subject concepts provided in total 421 connections for BFϕ\phi and 557 connections for BFAA, respectively, to all the paragraphs.

Figure 46 shows two bipartite graphs between subject concepts and paragraphs, one for BFϕ\phi and the other for BFAA. An edge means the subject concept appears in the paragraph. Only those subject concepts that connect at least two paragraphs are shown in the graphs. For BFϕ\phi, the graph has 67 subject concepts with 320 edges. For BFAA, the graph has 75 subject concepts with 454 edges. In other words, BFAA’s RDFs have more cross-paragraph connections through the subjects concepts.

For chapter 2, we collected a total of 211 and 191 subject concepts from BFϕ\phi and BFAA, respectively. These subject concepts provided in total 446 connections for BFϕ\phi and 497 connections for BFAA, respectively, to all the paragraphs. Figure 47 shows similar bipartite graphs. For BFϕ\phi, the graph has 72 subject concepts with 307 edges. For BFAA, the graph has 82 subject concepts with 388 edges. Again, BFAA’s RDFs have more connections.

Table 1 shows the top five subject concepts that have the highest connectivity (the highest number of occurrences) in Figure 46 (top). The subject entities grouped under the same concept are also listed. Table 2 shows the concepts and the corresponding entities for Figure 46 (bottom). Table 3 shows the concepts and the corresponding entities for Figure 47 (top). Table 4 shows the concepts and the corresponding entities for Figure 47 (bottom).

Figure 47: Top (BFϕ\phi): 72 subject concepts shared by at least the RDFs of two paragraphs with the total number of 307 edges shown in the bipartite graph; Note that this result includes all RDFs even though they fail the entailment check; Bottom (BFAA): 82 concepts shared by at least two paragraph and the total number of edges increases to 388; In the bipartite graphs, the bottom dots each represents a paragraph from chapter 2 and the upper squares each represents a subject concept.
Subject Concept BFϕ\phi Subject Entities
Instruction (23) New_Instructions, cacheControlInstruction,
fenceInstruction, StoreInstruction, IllegalInstructions
specificInstructions, Variable_Length_Instructions,
32_bit_instruction, Optional_Longer_Instructions,
LittleEndianInstruction, ErrorfulInstruction,
unprivilegedInstructions, Instruction, MachineInstruction,
optional_compressed_instruction, LoadInstruction
Trap (19) Opcode_Traps, Trap, InvisibleTrap, RequestedTrap,
FatalTrap, ContainedTrap
ISA (19) RISC-V_ISA, Base_Integer_ISA, Base_ISA, singleISA, ISA,
whyNotSingleISA, Unprivileged_ISA
Extension (16) Additional_Instruction_Set_Extension,
Specialized_Instruction_Set_Extension,
Optional_Extension, nonConformingExtension,
InstructionSetExtensions, Subsequent_Extensions,
GC_Extensions, Standard_Compressed_ISA_Extension,
StandardExtensions, StandardInstructionSetExtension,
compressed_instruction_set_extensions,
OtherExtensions, nonStandardExtension, Extensions
Execution Environment (15) Software_Execution_Environment,
Hardware_Execution_Environment,
RISC-V_Execution_Environment,
Hardware_and_Software_Execution_Environment,
BareMetalEnvironment, Execution_Environment
OperatingSystemEnvironment
Table 1: BFϕ\phi Entities from Chapter 1 with top occurrences. The first column shows the domain suffix term (subject concept) that is shared by the entities. The number in parenthesis indicates the number of paragraphs where the term occurs.
Subject Concept BFϕ\phi Subject Entities
Extension (38) Instruction_Set_Extension, Extension, nonStandardExtension,
vendorSpecificNonStandardExtension, compressed_extension,
standard_GC_extensions, Compressed_ISA_Extension,
customExtension, StandardCompressedInstructionExtension,
StandardLargerThan32BitInstructionSetExtension,
NonConformingExtension, StandardFloatingPointExtension,
openToExtensions, ISA_Extension,
OtherExtensions, has_extension
ISA (33) ISA, RISC-V_ISA, has_base_ISA, Base_Integer_ISA, unifiedISA,
not_treating_design_as_single_ISA, Unprivileged_ISA,
standard_IMAFD_ISA, singleISA, notSingleISA
Trap (33) Trap, differences_in_addressing_and_illegal_instruction_traps,
illegal_instruction_traps, NoTrap, FatalTrap, ContainedTrap,
RequestedTrap, DefinedSolelyToCauseRequestedTraps,
InvisibleTrap, OpcodeTrap
Instrucion (31) RISCVBaseInstructions, MultiplicationInstruction,
DoublePrecisionInstruction, ControlFlowInstruction,
IntegerComputationalInstruction, SinglePrecisionInstruction,
DivisionInstruction, AtomicInstruction, CompressedInstruction,
LoadInstruction, Instruction, MachineInstruction
Compressed_16_Bit_Instruction, 32_Bit_Instruction,
illegal_30_bit_instructions, illegalInstruction, 16BitInstruction,
AllZeroBitsInstruction, Variable-Length_Instruction,
AllOnesInstruction, KnownErrorfulInstructions, StoreInstruction,
MissingInstruction, UnprivilegedInstructions, CommonInstruction
Execution Environment (15) HardwareAndSoftwareExecutionEnvironment,
SoftwareExecutionEnvironment,
SupervisorLevelExecutionEnvironment,
UserLevelExecutionEnvironment, OuterExecutionEnvironment,
BareMetalEnvironment, ExecutionEnvironment
Hart (15) Hart, HostHart, GuestHart, EachHart, AllHarts
Table 2: BFAA Entities from Chapter 1 with top occurrences.
Subject Concept BFϕ\phi Subject Entities
Instruction (23) CSRInstructions, systemInstructions, CurrentInstruction, 16BitInstructions,
Jump_Instruction, Branch_Instruction, Conditional_Branch_Instruction,
Instruction, LoadUpperImmediateInstruction, RegularInstruction,
hasInstruction, IntegerComputationalInstructions,
SingleAdditionalBranchInstruction, AdditionalInstructions,
ControlTransferInstructions, unconditionalJumpInstructions,
ConditionalBranchInstruction, 16BitAlignedInstructions,
AddressOfBranchInstruction, PredicatedInstructions
Format (19) InstructionFormat, 3AddressFormat, 2AddressFormat, S_Format, R_Format,
U_Format, I_Format, B_InstructionFormat, J_InstructionFormat, B_format,
Base_Instruction_Formats, usesFormat, JTypeFormat, BTypeInstructionFormat
Register (12) numberOfRegisters, 32IntegerRegisters, IntegerRegisters, hasDestinationRegister,
placesResultInRegister, AlternateLinkRegister, RegularReturnAddressRegister,
GeneralPurposeRegister, linkRegister, TwoRegisters
ISA (9) baseIntegerISA, integerISA, completeISA, RV32I_ISA, BaseISA, ISA
Extension (9) OtherISAExtensions, InstructionSetExtension, NonConformingExtension,
CompressedInstructionSetExtension, AExtension, Extension
Table 3: BFϕ\phi Entities from Chapter 2 with top occurrences.
Subject Concept BFAA Subject Entities
Instruction (23) CSRInstructions, ConditionalBranchInstruction, BranchInstruction,
reservedInstruction, regular_instruction, load_upper_immediate_instruction,
IntegerComputationalInstruction, JumpInstruction, AdditionalInstructions,
LoadStoreInstructions, ControlTransferInstructions,
unconditionalJumpInstructions, PredicatedInstructions,
Immediate (17) 5-bit_Immediate, Immediate, J_immediate, J_shifted_immediate,
20_bit_immediate, U_immediate, U_shifted_immediate
ISA (16) SubsetOfBaseIntegerISA, ISA, BaseIntegerISA, RISC-V_ISA,
Unprivileged_State_for_Base_Integer_ISA, Complete_Set_of_Base_Integer_ISA,
integer_ISA, baseISA
Register (15) CSRRegisters, hasNoStackPointerOrSubroutineReturnAddressLinkRegister,
usesRegisterX5AsAlternateLinkRegister, 16_registers, number_of_registers,
larger_number_of_integer_registers, integer_register,
frequently_accessed_registers, returnAddressRegister
Format (12) InstructionFormat, 2-address_format, OptionalCompressed16BiatInstructionFormat,
CoreInstructionFormat, B_format, S_format, U_format, J_format,
BaseInstructionFormats, BTypeInstructionFormat, alternateLinkRegister,
ReturnAddressRegister
Table 4: BFAA Entities from Chapter 2 with top occurrences.

A.8 Background Facts

There were 204 background facts accumulated in our experiment. They are shown in Figures 48-50 at the end. Note that some of them were collected after their first appearance (recall that we processed paragraphs one by one in the order as they appeared in the Spec). For example, for some instruction names their BFs were not required for processing the paragraphs that defined them. The BFs were collected after the these paragraphs were processed and were used to process later paragraphs referencing the instruction names.

Figure 48: Background facts
Figure 49: Background facts Cont’d
Figure 50: Background facts Cont’d