Discovering Technology Gaps using the IntSight Knowledge Navigator

Aurpon Gupta1, Subhasis Dasgupta12, Snehasis Sinha3 and Amarnath Gupta12 1Integrative Insights, San Diego, California 92122
Email: [email protected] 2University of California San Diego, La Jolla, CA 92093, USA
Email: sudasgupta,[email protected] 3Walmart India, Bangalore, India 560103

Abstract

Knowledge analysis is an important application of knowledge graphs. In this paper, we present a complex knowledge analysis problem that discovers the gaps in the technology areas of interest to an organization. Our knowledge graph is developed on a heterogeneous data management platform. The analysis combines semantic search, graph analytics, and polystore query optimization.

1 Introduction

Creation, exploration and management of domain-specific knowledge has become an important research issue both in academia and in the industry [1, 2, 3, 4]. The goal of this paper is to present a real-life knowledge analysis problem and to introduce IntSight Knowledge Navigator (IKN), a tool designed to address this class of problems. Consistent with today’s methodology, the “knowledge” in our setting is modeled as a graph whose nodes represent entity classes, instances contain entity properties, and the edges represent class-level, instance-level, and class-membership relationships. We use the term “entity” to represent items of interest in an application domain. An item can be a concrete object like a commercially available product (e.g., a GPU), or a more conceptual entity like a technology domain (e.g., microelectronics). Figure 1 shows an example knowledge graph in our domain.

The User-Level Problem. Consider a customer that performs technology research and produces technology products in several different areas. It works with a large number of research and industrial partners who also produce research and/or products in the same space. The company is trying to create a future growth plan in its current areas of expertise as well as some related areas where it is trying to expand. As part of its competitive landscape analysis, it would like to perform a “technology gap analysis”, which is a discovery process to identify technology areas in which its competitors have made more progress compared to the company itself and its ecosystem of partners. The outcome of the discovery process is to understand the gap areas, the players involved, and the nature of their advancement.

In this paper, we present a methodology using a combination of semantic search, graph analysis, and polystore optimization.

Refer to caption — Figure 1: A toy knowledge graph

2 The Knowledge Graph Data Architecture

The knowledge graph underlying the IntSight Knowledge Network consists of the ontology and the data graph portions, as well as a mapping structure that bridges the two. The data is stored in a commercial version of the polystore [5] built on top of a relational DBMS (PostgreSQL), a graph DBMS (Neo4J) and, text indexes (Apache Solr) to store different portions of the knowledge graph.

Ontology. The ontology, modeled in OWL-DL [6], is transformed into a property graph that preserves all ontology properties [7]. The subproperty axioms in OWL-DL are maintained in a separate tree in the same property graph. Currently, the model does not have any chain rules. Each transitive ontological relationship like subclassOf and componentOf is a directed acyclic graph. Every node maintains a list of the root-to-node tree-path in Apache Solr to avoid explicit graph traversal in the graph database for transitive closure and reachability queries. For nodes with multiple root-to-node paths, we maintain additional paths from the parent of the nearest join-node (of a DAG) to the current node. Further, a term-to-path index is maintained to find relevant paths for a term.

Data Source	Data Model	Polystore Placement
Patents	Relational	PostgreSQL, Text in Solr
News articles	Structured Text	Solr, Entity Network in Neo4J
Federal Spending	Relational	PostgreSQL
Company Networks	Graph	Neo4J

TABLE I: Data from a source are processed and placed into different stores.

Data Graph. The data graph is constructed from heterogeneous data sources and distributed across all three stores. The knowledge graph construction process is beyond the scope of this document. Table I shows a few of the data sources, their structure, and their placement in the polystore. The knowledge graph is designed as a materialized polystore view over these component data sets. Unlike a database view that is defined as a single (potentially complex) query against a set of base tables or views, a polystore view is specified as a query script that specifies the relational, graph and text-index components of the view. For example, consider the patent data source which is partitioned into the patent metadata component residing in a table, and a patent description component, which is stored in Solr. The entities derived from patents include the patents themselves, organizations, individuals, the technology concepts pertaining to the patent; the relationships include co-ownership, IP-transfer, significant co-occurrences between technologies in selected sections of the patent, etc. As the view is materialized, the entities (together with their properties) are stored in separate relational tables, the relationships together with their end entities are stored in the graph database, and additional indices (e.g., one for time and technology terms) are stored in separate index structures. Figure 2 shows a pictorial view of this materialized structure.

Mapping Structures. The forward mapping structure is a fast lookup structure that captures the relationship between concepts and their instances in the materialized polystore view. This structure takes the form of a key list and three compressed posting lists that contain the IDs of these instances in the three stores. A reverse mapping structure is also maintained for every store, such that a string occurring in a record of that store is mapped to the corresponding concept node in the ontology. For example, strings “GPU” and ”Graphics Processing Unit” map to the same concept node. Note that this mapping is partial, and an algorithmically detected entity occurring in the data may not have a matching concept in the ontology.

The IntSight Knowledge Navigator tool sits on top of the materialized polystore view and performs knowledge operations that are provided through a set of API calls.

3 The Technology Gap Discovery Problem

Given a knowledge graph where technologies are treated as entities, the gap discovery problem involves two major operations (i) technology landscape analysis and (ii) landscape-based gap discovery.

Technology Landscape Analysis. Informally, a technology landscape captures the “players” in a set of specified technology areas and their activities. Hence, we define a technology landscape as the triple $L=(P,T,C)$ with
- Performance Relation $P=(Org,Int,Tech,M_{1},M_{2},\ldots)$ where $M_{i}$ is a key performance indicator like the number of patent applications of organization $Org$ on the specific technology $Tech$ in time interval $Int$
- Technology Correlation Graph $T$ whose nodes represent technologies $t\in dom(Tech)$ and edges represent an ontological relationship or a co-occurrence relationship, and
- Organizational Partnership Graph $C$ whose nodes are organizations $o\in dom(Org)$ and edges indicate if they have a cooperative relationship (e.g., joint patent holders, coauthors) in a technology area $t\in dom(Tech)$ .

Algorithm 1 Technology Landscape Analysis

1:procedure landscape(

Pos[],Neg[]

)

OntoList\leftarrow QExpand(Pos,Neg,maxD=8)\;

ROIs[]\leftarrow densifyingGraph(OntoList,minNodes=100,minClust=0.7,history=

‘5 years’

)

materialize(G)

lScape\leftarrow ConstructLandscape(ROIs)

6: return

lScape

7:end procedure

Algorithm 1 presents the steps for the Technology Landscape Analysis. The analysis starts with a user specifying two lists $Pos$ consisting of terms of interest, and $Neg$ consisting of terms not of interest. Thus, if $Pos=$ [“query processing”, “accelerator”] and $Neg=$ [FPGA], the use is interested in hardware accelerator technologies (excluding FPGAs) related to query processing tasks.

Query Expansion. The first step is to use the ontology for query expansion [8, 9] to collect a larger set of positive terms that cover the desired technology space from the ontology. The nominal algorithm computes the union of the transitive closure (along specific ontological relationships) of all terms in the $Pos$ list and subtracts from it the union of the transitive closure of terms in the $Neg$ list. However, for an ontology over a million nodes, multiple transitive closure operations is very expensive. We use the node-to-path index and the root-to-node labels effectively to perform query expansion efficiently.

Densifying Subgraph Detection. With the expanded set of ontological terms $T$ , we identify nodes $N$ in the data graph $D$ that correspond to these concept terms using the mapping structures. Using each node in $N$ , we identify subgraphs (regions of interest) around these nodes that satisfy the following conditions (a) the number of nodes in the subgraph exceeds $minNodes$ , (b) the average clustering coefficient of the subgraph exceeds $minClust$ , and (c) the period over which there is a monotonic increase in density matches or exceeds the value of the $history$ parameter. The densifying regions, collectively called the ROI graph $G$ , represent parts of the original data graph that are seeing significant activity in recent times. To facilitate the computation of densifying subgraphs, we compute a temporal index of communities in the graph computed through truss-based clustering computed in an HPC cluster. The index is updated every time the materialized view is updated at fixed time intervals. Further, each node in our knowledge graph always maintains some basic network properties about itself including its indegree, outdegree and clustering coefficient.

Landscape Construction. The technology nodes in the regions of interest are connected to Organization names that have, additional information about them in the entity tables of of the knowledge graph. Using the ROI graph $G$ , we extract the companies inside $G$ or within the first neighborhood of $G$ (since each technology term is connected to any organization working on it). For each of these companies, the technologies produced by these companies that are included in $G$ and a set of KPI metrics (e.g., number of patents/publications, $\ldots$ ) associated with these companies. Secondly, we extract the subgraph induced by the technology nodes from $G$ and merge it with subgraphs from the ontology containing these nodes to create the Technology Correlation graph $T$ . Similarly, we extract the subgraph induced by the companies in $G$ and its first neighbors to create the Organization Correlation Graph $C$ . The technology landscape thus created is returned to the user which the densifying graph $G$ is materialized for later reuse.

Algorithm 2 Technology Gap Analysis

1:procedure Gap(

lScape.C,C,cond[],me,\theta

)

s[]\leftarrow\emptyset

\gamma\leftarrow(G-Build_{EGO}(lScape.C,me))\;

\alpha\leftarrow GetQClique(G)

comp\leftarrow lScape.orgs\cap\gamma.orgs

\alpha_{org}\leftarrow getOrg(\alpha)

7: for

n\in comp

8: if

n\in\alpha_{org}

then

9: if

ORule(n,cond.org)

then

10: if

TRule(n.qCliques,cond.clique)

then

11: if

KPIDist(n.KPI,me.KPI)>\theta

then

12:

s.append(n,n.qCliques)

13: end if

14: end if

15: end if

16: end if

17: end for

18: return

s

19:end procedure

Gap Discovery. The gap discovery method presented in Algorithm 2, accepts the landscape and the materialized ROI graph from Algorithm 1, a reference organization $me$ against which the gap will be computed, and a set of predicates $cond[]$ explained below. The algorithm starts by computing the ego network of the $me$ organization to identify all partner institutions. All other organizations are considered as competitors, labeled as $comp$ . Independently, the ROI graph $G$ is mined for quasi-cliques $\alpha$ to identify technology combinations that have dominant activity in which the members of $comp$ participate. These participating organizations, labeled as $\alpha_{org}$ , are filtered based on three sets of conditions: 1) The organizations must satisfy a set of organization specific properties (e.g., amount of investment), 2) The technology areas they are working on must satisfy certain properties ( major activities in the area must not be less than a year old) and 3) the KPI distance between $me$ and the organization should be beyond a threshold $\theta$ . The KPI, are computed from the $M_{i}$ properties of the organization in the Performance Relation $P$ of the technology landscape. The result of the algorithm is the organization’s information from $P$ , and quasi-clique in $\alpha$ that the organization is associated with.

4 The IntSight Knowledge Navigator

The IntSight Knowledge Navigator is a web-based exploration and analysis tool that enables users to perform the two analyses through a dashboard-like interface. The tool uses the platform API and the Knowledge Graph API. For example, the gap analysis returns the organization ID and its high-activity technology areas. The API invokes a database join operation with the Performance Relation and the Entity Tables to construct the result objects needed for visualization.

The dashboard shows the results for every step in the process to facilitate human interaction. Figure 3 shows partial results for the query $Pos=[radar]$ . For result elements like the Performance Relation, the API returns the result of a data cube operation and the dashboard displays several different charts for different group combinations that the system finds informative based on factors used in visualization recommender systems [10]. For the same query Figure 4 shows the timeline of innovations for different technologies related to the query.

The user can choose to perform comparative gap analysis between organizations and technologies. Figure 5 shows the results of a gap capmparison of FPGAs vs. GPUs for a specific organization in the context of target tracking.

References

[1] X. Zhou, A. Eibeck, M. Q. Lim, N. B. Krdzavac, and M. Kraft, “An agent composition framework for the j-park simulator-a knowledge graph for the process industry,” Computers & Chemical Engineering, vol. 130, p. 106577, 2019.
[2] F. Leijie, B. Yv, and Z. Zhenyuan, “Constructing a vertical knowledge graph for non-traditional machining industry,” in 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC). IEEE, 2018, pp. 1–5.
[3] Y. Wang and J. Liu, “Product prediction based on average mutual information and knowledge graph,” Electronic Technology, p. 01, 2017.
[4] L. Zhang and S. Huang, “New technology foresight method based on intelligent knowledge management,” Frontiers of Engineering Management, vol. 7, no. 2, pp. 238–247, 2020.
[5] S. Dasgupta, K. Coakley, and A. Gupta, “Analytics-driven data ingestion and derivation in the awesome polystore,” in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 2555–2564.
[6] B. Motik, U. Sattler, and R. Studer, “Query answering for owl-dl with rules,” in International Semantic Web Conference. Springer, 2004, pp. 549–563.
[7] F. Gong, Y. Ma, W. Gong, X. Li, C. Li, and X. Yuan, “Neo4j graph database realizes efficient storage performance of oilfield ontology,” PloS one, vol. 13, no. 11, p. e0207595, 2018.
[8] J. Wu, I. Ilyas, and G. Weddell, “A study of ontology-based query expansion,” Dept. of Comp. Sc., Univ. of Waterloo, Tech. Rep. CS-2011–04, 2011.
[9] H. K. Azad and A. Deepak, “A new approach for query expansion using wikipedia and wordnet,” Information sciences, vol. 492, pp. 147–163, 2019.
[10] D. J.-L. Lee, V. Setlur, M. Tory, K. G. Karahalios, and A. Parameswaran, “Deconstructing categorization in visualization recommendation: A taxonomy and comparative study,” IEEE Transactions on Visualization and Computer Graphics, 2021.