Application of Artificial Intelligence in Data Quality Management

Keywords: SQL AI NLP monitor and control

To be honest, the concept of artificial intelligence is somewhat too large, including in-depth learning, machine learning, intensive learning and so on, while in-depth learning includes image recognition, speech recognition, natural language processing, prediction analysis; Machine learning includes supervised learning, unsupervised learning and semi-supervised learning. Supervised learning is subdivided into regression, classification, decision tree and so on. Theoretically, AI can do anything and cater to everything.
Key data quality management is so grounded that everyone understands the definition of data quality inspection rules, data quality inspection rules scripts, inspection rules execution engine, data quality inspection rules execution monitoring, data quality inspection reports. Data quality checking rules are just a bunch of checks for consistency, accuracy, uniqueness, authenticity, timeliness, relevance, integrity, and in plain words, SQL statements.
However, I am not facing data quality management based on data warehouse. Data warehouse is a subject-oriented, integrated, relatively stable data collection reflecting historical changes, which is used to support management decisions. The integration and subject-oriented characteristics of data warehouse determine that data is naturally related, and the existence of blood relationship is the premise of data quality retrospective. In short, the purpose is to ensure end-to-end data accuracy.
When I first joined the telecommunications operator, I was mainly responsible for the traffic and data business of the analysis system. The preliminary statistics showed that there were 7,800 tables and thousands of stored procedures in the subsystem, from interface layer to storage layer to intermediate table to summary layer to report layer. The design was also reasonable. Without knowing the business, in order to get started quickly, Having spent three months reading hundreds of stored procedures for over a hundred intermediate tables, resulting in several documents of human kinship, and in a year you can basically design a data model. This is a dumb method, but it is also the fastest.
After joining a new company, a lot of new concepts come out each year, but few of them fall into the ground. Data centers, digital platforms, data platforms, data centers, data lakes, Internet of Things, artificial intelligence platforms, cloud reconstruction. Yesterday, I read an article saying that data lake is a lazy data warehouse. The company is very interested in mining existing and historical business data. But there is no detailed plan for how to analyze it. So you need to save the data first and consider the analysis later, because for many companies, data is a huge asset. After all, data lake is trendy and can not solve practical problems, and hadoop and mpp are both cheap and large. The key is to avoid data quality and enterprise data model. I think that is probably the reason.
What do I have to face? A large number of business systems, some semi-centralized reporting systems, ETL tools, ETL tools, DBLink extracts, ogg synchronization, data collaboration between business systems based on master data and SOA, just like any early system, the original design is good, the actual landing will be greatly discounted. As a result, data quality problems have arisen. Data quality problems have also been broken down into technical problems and data problems. Of course, technical problems are the reasons for investigating the system construction, and of course, they can not be investigated. After all, data problems have been on line for so many years. To investigate the responsibility of business departments'entry and use personnel, it involves data acceptance, and data acceptance involves the interests of all parties. So Barabara.
But the work still needs to be done. Do everything you want to do. Personally, I think data quality problems can solve data homology problems in a short time. The long-term goal is to solve enterprise data model problems. After all, the data quality verification process is the process of understanding data, and the process of understanding data is the process of constantly familiarizing with business. Familiarizing yourself with the business will naturally allow you to think about the enterprise data model from the perspective of business people.
Previously, we talked about the goal of data quality problems. Of course, data quality problems inevitably involve system mechanisms, platform tools, and operation teams.

    1. Based on the maturity of data management capabilities and the problems in the company's data management process, a series of practical and workable data management systems can be established.

    2. With regard to data accountability, try based on business scenarios instead of just grabbing at the hair and eyebrows

    3. Set up a comprehensive data quality platform to provide automatic, self-service and intelligent platform support for data quality. Data should still be centralized and nothing can be done without centralization.

    4. Set up a set of data closed-loop management process to collect problem handling from bottom to top

    Fifth, there is a set of data governance organization structure, this general thunderstorm is small, so start from me, a data operation team and related division of work is still necessary.

In the end, we still need money. Money is not everything. Money is everything. Institutions, platforms and teams rely on money.
I talked a lot about data quality, how to solve the problem of artificial intelligence? In fact, I also learned some parts of the Internet, of course, I also repeatedly think, in order to work nights can not sleep.
Data quality problems mainly occur at the source, transmission and destination, which is easy to say and difficult to implement. Millions of fields in tens of thousands of tables are estimated to be consumed for a lifetime. What can we do? We rely on artificial intelligence and expert experience.
1. Determining the Data Quality Range

 1,Select some important items based on table heat analysis, reference object analysis, blood relationship analysis<a rel="dofollow" href=""title=" QQ auction platform "><span style=" color:rgba (38,38,38,1); ">QQ auction platform</span></a> tables and data items for data quality verification, and then confirm with the business department, generally speaking, very good.
 2,Tables that have little or no data can be ignored.
 3,We want to calculate the similarity matching of database tables based on the text similarity of natural language processing to judge whether the tables are highly duplicated or not. We can also exclude those tables with high similarity. SQL It's done. It has nothing to do with artificial intelligence.

2. Rules for Data Quality Inspection

 1,For a small number of core checking rules, select training data samples from large data, use machine learning algorithm for in-depth analysis, extract common features and models, which can be used to locate data quality reasons, predict data quality problems, and further form a knowledge base to enhance data quality management capabilities. To be honest, this is a bit dummy. No rule can dig out a fart.
 2,Based on the normal distribution and long tail distribution of data quality monitoring, through machine learning training, determine the data threshold, determine whether the data is abnormal data, can do pre-judgment, post-monitoring, for data quality; Of course, this range is extremely limited.
 3,More or auto-generated SQL The script is ready, the simplest rule of course, and the least important one is to generate checking rules from human reader code.

3. Data Model Management

 1,Machine learning techniques are used to analyze the reference heat of data entities in databases, to automatically identify the intrinsic relationships between data models through clustering algorithms, and to detect and evaluate the quality of data models. This is a copy, but it is similar to what I have written.
 2,Metadata management based on Knowledge Map-Conduct data link, blood relationship analysis, and application scenario aggregation.

4. Data Transfer Monitoring

 1,Use machine learning technology to analyze the historical situation of data in place, predict the time of data in place, and provide support for ensuring the timeliness of data processing and responding to the impact of data late arrival. This is appropriate in a data warehouse.
 2,The problem I'm facing is a bit complicated and I haven't thought about it yet.

5. Data Problem Discovery

 It is to locate specific non-range data links, such as identity cards, addresses, unit names, regular business numbers for word, sentence, semantic analysis, to enhance data quality and data security management capabilities.

That's all for the time being. Artificial intelligence can play a role in a small part of data quality management, but more depends on human and human code.

Posted by dilum on Tue, 02 Nov 2021 12:33:53 -0700