In previous blog posts, I've discussed the use of Machine Learning for Capacity Management when monitoring your PostgreSQL infrastructure. I’ve also covered some experiments I undertook that investigated the possibility of using Machine Learning for Text Prediction to improve the user experience when searching on the pgAdmin website.
In this post, I'll explore some of the options available to us to integrate machine learning techniques directly into our PostgreSQL database servers.
Why?
In one word, flexibility.
Our data is already in the database, and we have the powerful capabilities of PostgreSQL and SQL to select, aggregate and summarise and otherwise manipulate it as we need for analysis.
This isn't dissimilar from what we might do if writing code to analyse or predict data outside of the database; we need to feed numeric data into a neural network, often normalised to similar ranges of values (to prevent a wide range of values in one input overshadowing other inputs). This work can be made much easier with SQL, as can other tasks such as selecting the appropriate data to start with, and performing any other calculations that are required on the input data. We can use common table expressions (CTEs), window functions, mathematical functions such as log() and sqrt(), and anything else we can think of through use of stored functions for example.
Additionally, data is fed into a neural network as vectors or matrices. These data structures are essentially the same as we use in a database; a vector is essentially what we would call a record or tuple, and a matrix is a data or result set.
These properties give us significant advantages; we can easily separate the code for the neural network from the data preparation code, allowing us to easily modify or change what we're analysing.
What?
In theory we can use neural networks within PostgreSQL to undertake many of the common tasks that are possible:
Regression Analysis
This is perhaps the most common and obvious use case. Given a set of input values we can predict an output value. For example, we might train a network to predict the likelihood of a customer not renewing their contract based on their interaction with the support team, given data or features such as the number of tickets logged, the time to resolution, the severity and so on. Another example might be to predict the effect on sales based on the amount of money spent on different marketing channels such as Google Ads vs. newspaper adverts vs. conference sponsorship and so on.
Text Classification
A website collecting comments or reviews from users may want to try to classify those comments automatically. Was the review positive or negative? Is the user angry or happy?
Image Classification/Recognition
Whilst it's certainly possible to perform image classification and recognition in the database, it's probably not a great idea to do so simply because you probably shouldn't be storing images directly in the database for performance reasons unless you have transactional requirements that absolutely cannot be avoided.
Time Series Prediction
As discussed in my earlier blog, it may be advantageous to analyse and predict monitoring data, to aid in capacity management or alerting.
How?
So how can we do this? There are a couple of options available to us, the second of which is infinitely flexible.
In the examples below I make use of the Boston Housing Dataset. This is a popular dataset used in regression experiments, typically used to to predict house prices based on various features of the location, such as proximity to the river, taxes, average age of properties and so on.
Apache MADlib
Apache MADlib is an extension to PostgreSQL (or Greenplum) that provides "Powerful machine learning, graph, statistics and analytics for data scientists". Whilst the installation is a little cumbersome, once installed it provides an array (no pun intended) of additional functions that can be used via SQL to perform various data analysis tasks, including regression and classification.
Once installed, it's very easy to perform a regression analysis:
The medv column shows the actual house price (in thousands of dollars), whilst the predict column shows the house prices that were predicted, with the difference in the residual column. The rmse column shows the root-mean-squared-error between the actual and predicted values over the entire dataset (which is why the same value is seen in each row).
This query—which included training—is extremely fast. Running on my laptop it consistently executes in 70 - 100ms.
It is worth noting that no data preparation has been performed in this test; in practice we would analyse the data ahead of time to remove outliers and limit training to highly correlated features.
The code I used can be found on Github.
Procedural Languages
Procedural languages in PostgreSQL allow us to write functions and stored procedures in a wide variety of different programming languages (as well as C, SQL and PostgreSQL's pl/pgsql). PostgreSQL itself ships with Python, Perl and TCL support, though they may have to be explicitly installed depending on the packaging. There are also Open Source projects that add support for Java, Javascript (v8), Lua, R and more.
pl/python3 is a very common choice these days, and is widely supported by machine learning libraries such as Tensorflow and pyTorch. There are also rich libraries such as Numpy and Pandas to support numerical data and analytics, both of which are extremely useful in machine learning. Using pl/python3 (or other procedural languages) offers us the ultimate in flexibility; pretty much anything we can do in the underlying language, we can do in a SQL function or stored procedure.
Originally I wrote a function to analyse the Boston Housing dataset much as can be seen in the MADlib example, the code for which can be found on Github. This includes a single function housing_v that takes the same features as input as are found in the original dataset. It then trains a model based on the dataset, and uses that model to predict the house price based on the feature values passed to the function:
As can be seen, the result is pretty accurate, though this is in part explained by the fact that the function automatically removes outliers from the dataset prior to training. It is however, quite slow; building, training and predicting took around 13 seconds. That said, I haven't spent much time optimizing the code and there is a lot of debugging code in there, as well as code to analyse the dataset and produce a number graphs to show data outliers, distribution and training/validation loss.
I later modified that code to make it generic, so it can be used with any dataset. This can also be found in Github and includes a number of functions:
tf_analyse(
data_source_sql text,
output_name text,
output_path text
)
Parameters:
- data_source_sql: An SQL query returning at least 2 rows and 2 columns of numeric data to analyse.
- output_name: The name of the output to use in titles etc.
- output_path: The path of a directory under which to save generated graphs. Must be writable by the database server's service account (usually postgres).
tf_model(
data_source_sql text,
structure integer[],
output_name text,
output_path text,
epochs integer DEFAULT 5000,
validation_pct integer DEFAULT 10,
test_pct integer DEFAULT 10
) RETURNS double precision
Parameters:
- data_source_sql: An SQL query returning at least 5 rows and 3 columns of numeric data to analyse.
- structure: An array of integers indicating the number of neurons in each of an arbitrary number of layers. A final output layer will be added with a single neuron.
- output_name: The name of the output to use in titles etc.
- output_path: The path of a directory under which to save generated graphs and the model. Must be writable by the database server's service account (usually postgres).
- epochs: The maximum number of training epochs to run (default: 5000)
- validation_pct: The percentage of the rows returned by the query specified in data_source_sql to use for model validation (default: 10).
- test_pct: The percentage of the rows returned by the query specified in data_source_sql to use for model testing (default: 10).
Returns: The Root Mean Square Percentage Error calculated from the evaluation of the test data set.
tf_predict(
input_values double precision[],
model_path text
) RETURNS double precision[]
Parameters:
input_values: An array of input values, or an array of arrays of input values, e.g. ''{2, 3}'' or ''{{2, 3}, {3, 4}}''.
model_path: The full path to a Tensorflow model saved in .h5 format.
Returns: An array of predicted values.
As with the previous example, these functions include a lot of debugging code. A typical workflow might look like this:
Run tf_analyse, passing it the dataset to analyse. Review the resulting text output and graphs and adjust the input data as required to optimise training.
Run tf_model to train and save the model.
Run tf_predict as desired, passing it input features to predict the output.
Conclusion
Performing machine learning tasks such as regression analysis within PostgreSQL offers us the power of SQL and PostgreSQL to access and prepare our data, coupled with the ability to choose the technology we prefer for the analysis.
Apache MADlib gives us a simple and very fast method of performing regression and other types of analysis, whilst utilising PostgreSQL's pluggability through procedural languages offers us the flexibility to use the machine learning technology of our choice, albeit at the cost of requiring code to be written.
Look out for my next blog post in which I'll go into much deeper technical detail on how regression analysis can be performed in PostgreSQL using Tensorflow.