This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discovering Technology Gaps using the IntSight Knowledge Navigator

Aurpon Gupta1, Subhasis Dasgupta12, Snehasis Sinha3 and Amarnath Gupta12 1Integrative Insights, San Diego, California 92122
Email: [email protected]
2University of California San Diego, La Jolla, CA 92093, USA
Email: sudasgupta,[email protected]
3Walmart India, Bangalore, India 560103
Abstract

Knowledge analysis is an important application of knowledge graphs. In this paper, we present a complex knowledge analysis problem that discovers the gaps in the technology areas of interest to an organization. Our knowledge graph is developed on a heterogeneous data management platform. The analysis combines semantic search, graph analytics, and polystore query optimization.

1 Introduction

Creation, exploration and management of domain-specific knowledge has become an important research issue both in academia and in the industry [1, 2, 3, 4]. The goal of this paper is to present a real-life knowledge analysis problem and to introduce IntSight Knowledge Navigator (IKN), a tool designed to address this class of problems. Consistent with today’s methodology, the “knowledge” in our setting is modeled as a graph whose nodes represent entity classes, instances contain entity properties, and the edges represent class-level, instance-level, and class-membership relationships. We use the term “entity” to represent items of interest in an application domain. An item can be a concrete object like a commercially available product (e.g., a GPU), or a more conceptual entity like a technology domain (e.g., microelectronics). Figure 1 shows an example knowledge graph in our domain.

The User-Level Problem. Consider a customer that performs technology research and produces technology products in several different areas. It works with a large number of research and industrial partners who also produce research and/or products in the same space. The company is trying to create a future growth plan in its current areas of expertise as well as some related areas where it is trying to expand. As part of its competitive landscape analysis, it would like to perform a “technology gap analysis”, which is a discovery process to identify technology areas in which its competitors have made more progress compared to the company itself and its ecosystem of partners. The outcome of the discovery process is to understand the gap areas, the players involved, and the nature of their advancement.

In this paper, we present a methodology using a combination of semantic search, graph analysis, and polystore optimization.

Refer to caption
Figure 1: A toy knowledge graph

2 The Knowledge Graph Data Architecture

The knowledge graph underlying the IntSight Knowledge Network consists of the ontology and the data graph portions, as well as a mapping structure that bridges the two. The data is stored in a commercial version of the polystore [5] built on top of a relational DBMS (PostgreSQL), a graph DBMS (Neo4J) and, text indexes (Apache Solr) to store different portions of the knowledge graph.

Ontology. The ontology, modeled in OWL-DL [6], is transformed into a property graph that preserves all ontology properties [7]. The subproperty axioms in OWL-DL are maintained in a separate tree in the same property graph. Currently, the model does not have any chain rules. Each transitive ontological relationship like subclassOf and componentOf is a directed acyclic graph. Every node maintains a list of the root-to-node tree-path in Apache Solr to avoid explicit graph traversal in the graph database for transitive closure and reachability queries. For nodes with multiple root-to-node paths, we maintain additional paths from the parent of the nearest join-node (of a DAG) to the current node. Further, a term-to-path index is maintained to find relevant paths for a term.

Data Source Data Model Polystore Placement
Patents Relational PostgreSQL, Text in Solr
News articles Structured Text Solr, Entity Network in Neo4J
Federal Spending Relational PostgreSQL
Company Networks Graph Neo4J
TABLE I: Data from a source are processed and placed into different stores.

Data Graph. The data graph is constructed from heterogeneous data sources and distributed across all three stores. The knowledge graph construction process is beyond the scope of this document. Table I shows a few of the data sources, their structure, and their placement in the polystore. The knowledge graph is designed as a materialized polystore view over these component data sets. Unlike a database view that is defined as a single (potentially complex) query against a set of base tables or views, a polystore view is specified as a query script that specifies the relational, graph and text-index components of the view. For example, consider the patent data source which is partitioned into the patent metadata component residing in a table, and a patent description component, which is stored in Solr. The entities derived from patents include the patents themselves, organizations, individuals, the technology concepts pertaining to the patent; the relationships include co-ownership, IP-transfer, significant co-occurrences between technologies in selected sections of the patent, etc. As the view is materialized, the entities (together with their properties) are stored in separate relational tables, the relationships together with their end entities are stored in the graph database, and additional indices (e.g., one for time and technology terms) are stored in separate index structures. Figure 2 shows a pictorial view of this materialized structure.

Refer to caption
Figure 2: The information processing architecture of the System

Mapping Structures. The forward mapping structure is a fast lookup structure that captures the relationship between concepts and their instances in the materialized polystore view. This structure takes the form of a key list and three compressed posting lists that contain the IDs of these instances in the three stores. A reverse mapping structure is also maintained for every store, such that a string occurring in a record of that store is mapped to the corresponding concept node in the ontology. For example, strings “GPU” and ”Graphics Processing Unit” map to the same concept node. Note that this mapping is partial, and an algorithmically detected entity occurring in the data may not have a matching concept in the ontology.

The IntSight Knowledge Navigator tool sits on top of the materialized polystore view and performs knowledge operations that are provided through a set of API calls.

3 The Technology Gap Discovery Problem

Given a knowledge graph where technologies are treated as entities, the gap discovery problem involves two major operations (i) technology landscape analysis and (ii) landscape-based gap discovery.

Technology Landscape Analysis. Informally, a technology landscape captures the “players” in a set of specified technology areas and their activities. Hence, we define a technology landscape as the triple L=(P,T,C)L=(P,T,C) with
- Performance Relation P=(Org,Int,Tech,M1,M2,)P=(Org,Int,Tech,M_{1},M_{2},\ldots) where MiM_{i} is a key performance indicator like the number of patent applications of organization OrgOrg on the specific technology TechTech in time interval IntInt
- Technology Correlation Graph TT whose nodes represent technologies tdom(Tech)t\in dom(Tech) and edges represent an ontological relationship or a co-occurrence relationship, and
- Organizational Partnership Graph CC whose nodes are organizations odom(Org)o\in dom(Org) and edges indicate if they have a cooperative relationship (e.g., joint patent holders, coauthors) in a technology area tdom(Tech)t\in dom(Tech).

Algorithm 1 Technology Landscape Analysis
1:procedure landscape(Pos[],Neg[]Pos[],Neg[])
2:     OntoListQExpand(Pos,Neg,maxD=8)OntoList\leftarrow QExpand(Pos,Neg,maxD=8)\;
3:     ROIs[]densifyingGraph(OntoList,minNodes=100,minClust=0.7,history=ROIs[]\leftarrow densifyingGraph(OntoList,minNodes=100,minClust=0.7,history= ‘5 years’))
4:     materialize(G)materialize(G)
5:     lScapeConstructLandscape(ROIs)lScape\leftarrow ConstructLandscape(ROIs)
6:     return lScapelScape
7:end procedure

Algorithm 1 presents the steps for the Technology Landscape Analysis. The analysis starts with a user specifying two lists PosPos consisting of terms of interest, and NegNeg consisting of terms not of interest. Thus, if Pos=Pos= [“query processing”, “accelerator”] and Neg=Neg= [FPGA], the use is interested in hardware accelerator technologies (excluding FPGAs) related to query processing tasks.

Query Expansion. The first step is to use the ontology for query expansion [8, 9] to collect a larger set of positive terms that cover the desired technology space from the ontology. The nominal algorithm computes the union of the transitive closure (along specific ontological relationships) of all terms in the PosPos list and subtracts from it the union of the transitive closure of terms in the NegNeg list. However, for an ontology over a million nodes, multiple transitive closure operations is very expensive. We use the node-to-path index and the root-to-node labels effectively to perform query expansion efficiently.

Densifying Subgraph Detection. With the expanded set of ontological terms TT, we identify nodes NN in the data graph DD that correspond to these concept terms using the mapping structures. Using each node in NN, we identify subgraphs (regions of interest) around these nodes that satisfy the following conditions (a) the number of nodes in the subgraph exceeds minNodesminNodes, (b) the average clustering coefficient of the subgraph exceeds minClustminClust, and (c) the period over which there is a monotonic increase in density matches or exceeds the value of the historyhistory parameter. The densifying regions, collectively called the ROI graph GG, represent parts of the original data graph that are seeing significant activity in recent times. To facilitate the computation of densifying subgraphs, we compute a temporal index of communities in the graph computed through truss-based clustering computed in an HPC cluster. The index is updated every time the materialized view is updated at fixed time intervals. Further, each node in our knowledge graph always maintains some basic network properties about itself including its indegree, outdegree and clustering coefficient.

Landscape Construction. The technology nodes in the regions of interest are connected to Organization names that have, additional information about them in the entity tables of of the knowledge graph. Using the ROI graph GG, we extract the companies inside GG or within the first neighborhood of GG (since each technology term is connected to any organization working on it). For each of these companies, the technologies produced by these companies that are included in GG and a set of KPI metrics (e.g., number of patents/publications, \ldots) associated with these companies. Secondly, we extract the subgraph induced by the technology nodes from GG and merge it with subgraphs from the ontology containing these nodes to create the Technology Correlation graph TT. Similarly, we extract the subgraph induced by the companies in GG and its first neighbors to create the Organization Correlation Graph CC. The technology landscape thus created is returned to the user which the densifying graph GG is materialized for later reuse.

Algorithm 2 Technology Gap Analysis
1:procedure Gap(lScape.C,C,cond[],me,θlScape.C,C,cond[],me,\theta)
2:     s[]s[]\leftarrow\emptyset
3:     γ(GBuildEGO(lScape.C,me))\gamma\leftarrow(G-Build_{EGO}(lScape.C,me))\;
4:     αGetQClique(G)\alpha\leftarrow GetQClique(G)
5:     complScape.orgsγ.orgscomp\leftarrow lScape.orgs\cap\gamma.orgs
6:     αorggetOrg(α)\alpha_{org}\leftarrow getOrg(\alpha)
7:     for ncompn\in comp do
8:         if nαorgn\in\alpha_{org} then
9:              if ORule(n,cond.org)ORule(n,cond.org) then
10:                  if TRule(n.qCliques,cond.clique)TRule(n.qCliques,cond.clique) then
11:                       if KPIDist(n.KPI,me.KPI)>θKPIDist(n.KPI,me.KPI)>\theta then
12:                           s.append(n,n.qCliques)s.append(n,n.qCliques)
13:                       end if
14:                  end if
15:              end if
16:         end if
17:     end for
18:     return ss
19:end procedure

Gap Discovery. The gap discovery method presented in Algorithm 2, accepts the landscape and the materialized ROI graph from Algorithm 1, a reference organization meme against which the gap will be computed, and a set of predicates cond[]cond[] explained below. The algorithm starts by computing the ego network of the meme organization to identify all partner institutions. All other organizations are considered as competitors, labeled as compcomp. Independently, the ROI graph GG is mined for quasi-cliques α\alpha to identify technology combinations that have dominant activity in which the members of compcomp participate. These participating organizations, labeled as αorg\alpha_{org}, are filtered based on three sets of conditions: 1) The organizations must satisfy a set of organization specific properties (e.g., amount of investment), 2) The technology areas they are working on must satisfy certain properties ( major activities in the area must not be less than a year old) and 3) the KPI distance between meme and the organization should be beyond a threshold θ\theta. The KPI, are computed from the MiM_{i} properties of the organization in the Performance Relation PP of the technology landscape. The result of the algorithm is the organization’s information from PP, and quasi-clique in α\alpha that the organization is associated with.

4 The IntSight Knowledge Navigator

The IntSight Knowledge Navigator is a web-based exploration and analysis tool that enables users to perform the two analyses through a dashboard-like interface. The tool uses the platform API and the Knowledge Graph API. For example, the gap analysis returns the organization ID and its high-activity technology areas. The API invokes a database join operation with the Performance Relation and the Entity Tables to construct the result objects needed for visualization.

Refer to caption
Figure 3: A spider chart showing the result of query expansion and the relative volumes of data from multiple data sources.

The dashboard shows the results for every step in the process to facilitate human interaction. Figure 3 shows partial results for the query Pos=[radar]Pos=[radar]. For result elements like the Performance Relation, the API returns the result of a data cube operation and the dashboard displays several different charts for different group combinations that the system finds informative based on factors used in visualization recommender systems [10]. For the same query Figure 4 shows the timeline of innovations for different technologies related to the query.

Refer to caption
Figure 4: Comparative timeline of innovations in the technology area in the specified time interval.

The user can choose to perform comparative gap analysis between organizations and technologies. Figure 5 shows the results of a gap capmparison of FPGAs vs. GPUs for a specific organization in the context of target tracking.

Refer to caption
Figure 5: A comparative analysis of gaps for two technologies. The bar charts on the left are the leading organization in the technology area.

References

  • [1] X. Zhou, A. Eibeck, M. Q. Lim, N. B. Krdzavac, and M. Kraft, “An agent composition framework for the j-park simulator-a knowledge graph for the process industry,” Computers & Chemical Engineering, vol. 130, p. 106577, 2019.
  • [2] F. Leijie, B. Yv, and Z. Zhenyuan, “Constructing a vertical knowledge graph for non-traditional machining industry,” in 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC).   IEEE, 2018, pp. 1–5.
  • [3] Y. Wang and J. Liu, “Product prediction based on average mutual information and knowledge graph,” Electronic Technology, p. 01, 2017.
  • [4] L. Zhang and S. Huang, “New technology foresight method based on intelligent knowledge management,” Frontiers of Engineering Management, vol. 7, no. 2, pp. 238–247, 2020.
  • [5] S. Dasgupta, K. Coakley, and A. Gupta, “Analytics-driven data ingestion and derivation in the awesome polystore,” in 2016 IEEE International Conference on Big Data (Big Data).   IEEE, 2016, pp. 2555–2564.
  • [6] B. Motik, U. Sattler, and R. Studer, “Query answering for owl-dl with rules,” in International Semantic Web Conference.   Springer, 2004, pp. 549–563.
  • [7] F. Gong, Y. Ma, W. Gong, X. Li, C. Li, and X. Yuan, “Neo4j graph database realizes efficient storage performance of oilfield ontology,” PloS one, vol. 13, no. 11, p. e0207595, 2018.
  • [8] J. Wu, I. Ilyas, and G. Weddell, “A study of ontology-based query expansion,” Dept. of Comp. Sc., Univ. of Waterloo, Tech. Rep. CS-2011–04, 2011.
  • [9] H. K. Azad and A. Deepak, “A new approach for query expansion using wikipedia and wordnet,” Information sciences, vol. 492, pp. 147–163, 2019.
  • [10] D. J.-L. Lee, V. Setlur, M. Tory, K. G. Karahalios, and A. Parameswaran, “Deconstructing categorization in visualization recommendation: A taxonomy and comparative study,” IEEE Transactions on Visualization and Computer Graphics, 2021.