AI and Databases

Comments 0

Share to social media

Although it is probably true to say that “AI is reshaping the landscape of database technology“, it’s also an over-simplification. The history of relational databases is more complex than this because the teams that created the first relational databases systems pioneered many of the ideas and technologies that underpin machine learning and artificial intelligence. They needed self-adapting systems to provide efficient query plans. However, the current generation of general-purpose AI has much to offer in ways to assist in the running, monitoring and maintenance of relational databases.

The story so far

While AI has overtaken our collective vernacular in the past few years, it isn’t a completely new idea. We have been trying to make computers do more and more of our tasks for many years. In this section I will cover some of the ways this has affected how we access and process data.

SQL: asking questions rather than specifying how to get an answer.

SQL served as a direct precursor to today’s Natural Language Interfaces (NLI) by allowing users to interact with databases more naturally. As a language, SQL was originally envisaged as a plain English way of asking questions of a database, indicating what data was required, rather than needing to know about computer languages. ‘select, insert, delete, create’ aren’t gobbledygook, but have cultural meaning.

The idea was to democratize access to data, allowing the users, business analysts, managers, and other non-technical stakeholders to get specific information without needing to rely on database administrators or data scientists. SQL bridged the gap between human language and machine operations.

This good intention was there from the start, but it is no longer obvious that SQL uses a human-oriented language; the geeks eventually muddied the waters with additions such as CTEs, and Windowing functions, best hidden by views. Besides which, the early users may have tapped in SQL queries at a terminal, but database applications are nowadays much more insulated from what was once an easily intelligible, open-ended, way of querying data.

Strategies for accessing the right data.

Relational databases were faced with the SQL Language, which merely expressed what data is required. It is left to the relational database to work out the most economical and fastest strategy to get the data that meets these requirements. This is done in relational databases by the query optimiser, often ranked as being the most complex production software in existence. To create a database system is easy. To create one at enterprise-scale with query planning, query optimisation and parallelism is very difficult.

Faced with a query, the query optimizer must come up with a query plan. To do this it must explore an enormous search space in a database that could be rapidly changing in size or data distribution. It must come up with just the likeliest candidates from all the possible plans, the number of which can grow exponentially with the complexity of the query and of the database. It must decide how to best distribute work across multiple processors or threads by estimating the overhead of parallel execution and of synchronizing tasks.

How can it select the likeliest plans? The Query optimiser relies on heuristics and rules to narrow down the likely strategies and avoid exhaustive searches of query plans. It then uses cost-based algorithms. This really means that it must attach a ‘cost’ to different query execution plans using a cost model, which factors in I/O operations, CPU usage, memory, and even, in distributed environments, the network latency. To achieve this, it must also understand the underlying hardware and the data distribution.

It must also even out the number of rows handled by each stage of a query, to avoid the consequences of having to handle too many rows at any stage. To do this, it needs statistics about the distribution of data, and when faced with skewed data distributions or multi-column dependencies, some ‘guesswork’ or probabilistic modelling, especially.

The design of the query optimiser has had to respond to increasingly complex queries due to the introduction of new SQL constructs, including complex analytical functions, recursive queries, and windowing functions, while maintaining compatibility with legacy syntax.

Although the database system will be able to adapt in the short term, and even inform administrators via reports where there are missing indexes, it cannot yet adapt a database permanently without assistance.

Growth areas in exploiting AI for databases

There is a lot of work required to develop and maintain a production database. Data must be checked for correctness; it must be possible to generate fake data for tests. We must check for anomalies that might indicate intrusion or criminal activity. It is also important to check performance and predict stress or potential failure before it occurs.

We need to optimise the database design in the face of increasing use and the growth of the data. We must make it as easy as possible to develop new or enhanced features to a database that will be robust and easy to maintain.

All these tasks can, to a greater or lesser extent, already be helped by AI.

AI in Database Test Data Generation

There are already several tools that use AI for generating database test data, such as Tonic.ai, Syntho, Mostly AI, Datomize, Gretel.ai and SDV. However, there are many possibilities that haven’t yet been exploited. The general aim is to create synthetic data that mimics the characteristics, patterns, and distributions of real-world data, which is crucial for testing, development, and the training of users. Some tools will use existing data, and anonymize it to generate test data, whereas others can generate it from the cultural milieu.

The reason for using AI to do this is to better reflect production environments, leading to more accurate testing outcomes. It helps to test the scalability and efficiency of the database design under various load conditions by generating large volumes of test data quickly. It also tests the robustness of the database in its ability to cope with ‘edge’ cases that test the assumptions of the constraints, and data entry routines to ensure that bad data cannot enter the system. In addition, can also test for security issues such as sql injection and other avenues of penetration.

The main difficulty with automatic data generation is that it is far harder than it might appear to generate realistic data, because we find it much easier to see that data is incorrect than to explain how to get it right. There are relationships within rows. They can be simple, such as between names and addresses for example, but they can be more subtle.

AI could be invaluable in defining these complex relationships in a systematic way. If we can find ways of discovering and representing these relationships, it will help both data generation and data checking

AI in Database Administration, Monitoring, Management and Optimization

The monitoring and maintenance work of a Database Administrator, today, is broadly similar to what it was twenty years ago. There has been progress in some areas such as the processing of alerts to predict when they should be the basis for a notification, but the idea of a self-healing relational database is still some way from reality. In the meantime, there are AI algorithms that can monitor database performance and automatically adjust configurations for indexing and caching, for example, to optimize performance. This reduces the need for manual tuning and allows databases to self-optimize in response to changing workloads.

Predictive analytics are potentially more useful for Maintenance: AI should be better at predicting potential issues such as storage bottlenecks, query slowdowns, or hardware failures before they become an issue, allowing the database admin to plan ahead for storage, do proactive maintenance and take all necessary steps to minimize downtime.

An obvious benefit would come from Data security and anomaly detection: It’s not just about alerting on repeated failed passwords. We need to know about all sorts of unexpected events. AI models can identify a range of unusual patterns in database access that would warrant investigation by an admin, helping to be more effective in detecting and responding to potential security threats or data breaches around the clock.

AI for the more complex ETL processes

In large enterprises with complex data ecosystems, importing and updating data entails a complex process of integration and transformation. This process is greatly simplified by techniques such as Automated Data Mapping and Schema Matching. If data is coming from several disparate sources, and sometimes changing in its metadata, AI can automatically map the data into a unified schema, simplifying the process of integrating data from disparate systems. AI can also optimize ETL workflows by dynamically adjusting the data pipelines based on current conditions, such as workload.

For extracting data from a variety of legacy sources such as paper documents, it is now possible to use natural language processing to identify the fields from which a business needs to extract data. It has been used for complex searches for criminal investigations, or audits, but is also useful if a business wants to extract customer information from purchase orders or invoices that are paper-based or in a different format. Essentially, you just specify the fields you need to use, and the tool will extract it regardless of the format.

AI can provide a different approach to detecting bad data and inconsistencies before it is caught by database check constraints. Another way AI algorithms can be used is to scan datasets before import for errors, inconsistencies, and anomalies.

AI and Developer support

Automatic code generators have been with us since the 1970s when a new language was introduced optimistically called ‘The Last One’. It has taken a long time to develop a way of generating code effectively. In the meantime, AI has already proved to be quite useful in supporting the developer and is excellent for all those sticking points that would otherwise result in searching ‘Stack Exchange’.

There are products that aim to provide direct support for coders, such as GitHub Copilot, DataLab’s AI Assistant, Anaconda Assistant or Jupyter AI. These tools are better suited for developers looking for efficient, in-line coding assistance, directly within their IDE or notebook environment. They are best in helping with languages that are unnecessarily verbose or arcane. They work best when providing real-time, context-aware code suggestions. Because they are deeply integrated with the chosen development environment, and can improve as they learn from the developer’s code, they can benefit the quality of the developer’s work

More general-purpose AI works best as a general-purpose assistant for coding and beyond. It’s particularly valuable for its versatility, ability to explain complex concepts, and help across a broad range of topics. However, because it has no direct integration with development environments and cannot access your codebase, it is less useful for continuous, context-aware coding assistance.

AI and the vision of the self-tuning database.

What if relational databases could inform us of database design issues? Metadata queries can already inform us of missing indexes. Some systems can also suggest schema modifications to improve performance and efficiency.

From my own experience, I’d suggest that most performance problems in relational databases come from a poor fundamental design, which in turn arises from a misunderstanding of the relational model. Whereas it is possible to improve performance of a poor design by adroit indexing and other techniques, it just buys time. AI that can help with normalization would be a great asset, but this would require a strong understanding of the business or commercial processes that are supported by the database.

What if relational systems were able to correct and improve the design of databases themselves? In my experience, I’d say that this could be a misguided ambition unless there was supervision. AI isn’t infallible: it would be better to run the idea past an experienced database developer or admin first and put the modifications through a deployment process. Having said that, we already have relational database systems that are self-tuning in the sense that they can optimize the database schema for performance, indexing and storage, based on usage patterns.

Automated Schema Design: although an entire database domain of a substantial database application is beyond AI’s reach, AI tools can assist in designing some aspects of database schemas by analyzing the data and recommending optimal structures. This is especially useful for ETL systems when dealing with the more complex datasets where the best schema may not be immediately obvious. These tools can assist in generating schemas, optimizing performance, ensuring consistency, and even predicting future needs based on current design choices.

For the simpler databases, AI can help to create adaptive data models that evolve as the underlying data changes. This can be useful in fast-changing industries where the nature of the data can change rapidly.

Query Execution in Hybrid and Distributed architectures

Databases are increasingly adopting hybrid architectures that combine on-premise and cloud storage, and use distributed systems. These require a pre-processor and orchestrator that works on top of traditional databases and their query optimizers. Its job is to decide how to best distribute and execute queries across several environments, while the actual execution in each environment still relies on the native optimizers. Such federated query engines or middleware solutions already exist (e.g. Presto/Trino, Apache Drill, Apache Calcite or Dremio). These federated query engines can handle distributed queries across several data sources and provide a degree of optimization. These engines can be enhanced with AI capabilities that help them to adapt to changing conditions by optimizing queries dynamically based on data location, network latency, and resource availability.

Conclusion

AI is profoundly influencing both general and relational database use, offering significant benefits in terms of efficiency, accessibility, and security. The shift towards more intelligent, self-managing, and user-friendly databases is likely to continue as AI technologies advance. As a result, database professionals will increasingly need to incorporate AI tools into their workflows to stay competitive and meet the evolving demands of the industry.

Load comments

About the author

Phil Factor

See Profile

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 40 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Phil Factor's contributions