In order to facilitate reading, this paper is divided into three articles. Firstly, this paper introduces the technical background, unified data format design and rule design; The second part introduces the extraction program design; The third part introduces rule configuration interaction design
1. Background
Knowledge extraction is the process of extracting knowledge information from data. According to the structural degree of data, it is divided into structured data knowledge extraction, semi-structured data knowledge extraction and unstructured data knowledge extraction.
The most common structured data is tabular data. In the traditional information system, a large number of relational database database table data are tabular data. Although some fields may be text and binary data (i.e. unstructured), they are generally regarded as structured data. In addition, common excel forms and tables in Word/PDF files can also be regarded as structured data, but those data with particularly free and arbitrary format are not included in this column. People familiar with Excel operation know that when using table software, they should try to unify the format of columns, avoid cell merging and use numbers as much as possible. In fact, this is because structured data calculation is more convenient and convenient for computer automatic processing, while those tables edited at will at the beginning are difficult to use later.
So since structured data itself has a good structure, why do we need to do knowledge extraction? Of course, this is not because there should be this process to build a knowledge map. The main motivation is that the structure of data is inconsistent with that of knowledge. The necessity and main advantages of building a knowledge atlas system are the aggregation, association and integration of massive data, and the data has multiple sources (files, databases, networks, IO devices, etc.), multiple formats (such as txt, word, pdf, etc.), multiple modes (text, picture, voice, video), and multiple structures (with different data element definitions, compositions, etc.), Knowledge extraction is the first step to solve this key problem, followed by knowledge fusion.
For example, we need to fuse data from two systems, including A, including "field tables" (including fields: id, name, position). B system contains "personnel table" (including fields: id number, name, gender, age), and the data of the two tables are kept in duplicate. Obviously, we need to aggregate and fuse the data of the two tables to form instances of "people". However, the structure of the two tables is different, and the database system can not solve this problem, which must be solved through higher-level technology and wisdom.
On the one hand, because the structured data itself is relatively structured and the data content is relatively standardized (what if it is not standardized? Preprocessing through the data governance system), the semantics of the data content is clear, so the most commonly used method for knowledge extraction is rule-based. This set of rules defines how to map or transform the input structured data to the target structure of the knowledge map (i.E. ontology or knowledge Schema). This process is essentially a shallow data conversion, which is essentially the "T" in ETL Technology. The knowledge extraction for unstructured data can be considered as "E" + "T", and the relevant contents will be introduced later.
2. Unified data format design
Firstly, a unified format is designed for structured data to facilitate subsequent unified processing. Use Table to represent a two-dimensional Table. Each Table has a header and a Table row. The header includes the name, type, remarks and other information of each field of the Table. The Table row is the specific data in the Table, and the data unit of the Table row matches the corresponding field definition.
In order to meet the application needs, the data format includes three aspects:
- Business item: actual business data
- Identification item: data identification designed for business or purely technical purposes. Generally, one is used, such as id or_ id, in some cases, multiple identification items, business id and physical id will be used. Physical id is easy to distinguish in data processing. It is often self increasing id or UUID. It makes the id more neat through random, hashing and other methods.
- Metadata item: meta information other than identification, such as data source (type, database address, etc.), data type, file name, etc
Take a look at the document structure of ElasticSearch:
{ "_id":"xxx",//Document ID "_index":"index-name",//Index name "_type":"type-name", //The type name will be unified into doc after ES6 "source": { } //Business data }
It can be seen that ES uses source to represent business data, which is used at the top level of the object_ id,_ index,_ type is the identification of the data. Expand this data structure, add meta field to represent meta information, and design the data structure as follows:
{ "_id":"xxx",//Document ID "data": { } //Business data "meta": { "source": { "type":"file", "path": "/home/chenbo/data/1.csv"}} }
meta can be extended according to specific needs. The example contains data source information, indicating that the data source type is a file, and the file path is / home/chenbo/data/1.csv.
The above formats are logically rich, but in application, due to the same metadata of a large number of business data, resulting in a certain degree of redundancy, a more compact format is designed:
{ "meta": { "source": { "type":"db", "dbtype":"mysql", "host":"10.0.0.1", "port":3306, "database": "buz1", "table": "person"}}, "rows": [ {"_id": "xxx", "data": {} } //A piece of data ] }
Several table rows are represented by items. The format of each item is consistent with the previous format.
For ease of use, the business data field data supports two formats:
- Ordinary JSON objects, i.e. K-V format
- Cell list format: {"_id", "header": ["name"], values: ["Chen Bo"]}
The header can be promoted to the top level. The format is as follows:
{ "meta": { "source": { "type":"db", "dbtype":"mysql", "host":"10.0.0.1", "port":3306, "database": "buz1", "table": "person"}}, "header": [], //Header data "rows": [ {"_id": "xxx", "values": [] } //A piece of data ] }
In addition, it supports the extension of the header. Each item table in the header has a field. If it is a string, it represents the field name, and the field type is automatically pushed by the program; If it is a JSON object, you can explicitly declare the name, type, default value, description, etc. of the field, as shown below:
{ "header": [ {"name": "name","type":"string","default":null,"comment":"Personnel name" ] }
The above data format has rich expression ability, simplicity and efficiency, and can be retrieved and expanded according to business.
3. Extraction rule design
Rule design based on JSON format:
{ "rules": { // Extraction rules "idPrefix": "taskABC", // Uniformly specified ID prefix Can identify a batch of data "nodes": [//Point rules are used to extract entities, events and documents { "_nodeId": "0", //Sequence number Convenient edge rule reference "tableId": "table1", //Node source table. A node supports only one source table "id": "_", //Node ID configuration Two configuration modes are supported, which can be blank "keyFields": ["@name"], //Identification field For conflict detection "name": "@name", //Node name rule Two configuration modes are supported "mustFields": ["@Entity_name"], //Mandatory Field "type": {"id": "/entity/human", name: "human"}, //Entity type Three configurations are supported "mappings": [//Mapping of columns to attributes {"name": {"id": "/p/name", "name": "Entity_name"}, "value": "@Name of Transferor"} //name supports three methods value supports 2 types ] }], "edges": [{//Edge rules are used to extract the relationship between any objects "tableId": "table1", //Edge source table. An edge only corresponds to one table "id": "_", //ID supports 2 configuration modes, which can be blank "type": {"id": "/relation/work", name: "work"}, //Relationship type Three configurations are supported "fromId": "@personnel ID", //Header node ID, support array "toId": "@company ID", //Tail node ID, support array "mappings": [], //Same as node mapping "fromNode": "0", //Header node ID, referenced from nodes._nodeId "toNode": "1", //Tail node ID, referenced from nodes_ nodeId "directed": true //Is it a directed edge }] } }
explain
- Mapping rules are divided into point rules and edge rules
- Both points and edges support ID configuration, type configuration and any number of attribute configuration, in which ID and type are required; For edges, you also need to configure fromId and toId
- ID configuration supports two types: automatic generation; Data field value, supporting multi field connection
- Two types of entity names are supported: data field value (in theory, multi field connection can also be supported); Literal
- Three types of configuration are supported: Schema selection; Data field value; Literal
- Two types of attribute name configuration are supported: Schema selection; Literal value (usually field name)
- Two types of attribute value configuration are supported: data field value; Literal
- fromId and toId is the head and tail node ID of the edge. Generally, it adopts the data field value, corresponds to the node ID, and supports multi field connection
- keyFields identification field, used for conflict detection; When the ID is empty, keyFields is automatically used as the ID
- mustFields is a required field. If the corresponding field is empty, a mandatory conflict will occur
- idPrefix is used to identify the same batch of data; For the convenience of data maintenance and management, the front end can use the atlas ID or the identifier entered by the user.
Rule configuration method
- _: Automatic generation
- @Field name: data field value
- xxx: literal value. xxx is directly used as the corresponding data
- {id, name} Schema selection, passing id and name information Easy to use