What is data mining?

An explanation of data mining and its benefits. This page covers the history of data mining, methods and techniques, and the technology challenges it presents. It also includes examples of how data mining can be used in different industry verticals.

Data mining definition

Data mining describes the process of discovering valuable insights by collecting and comparing data from disparate and often unconnected sources. Computational processes extract valuable insights that can be used by organisations for a diverse range of tasks, including better understanding their customers, improving efficiency and predicting behaviour.

In this way, patterns and correlations within large data sets can be identified, helping organisations better understand their customers, find bottlenecks in their distribution systems and even predict behaviours.

Data mining is also used to detect anomalies in any process that can produce unpredictable errors that would be revealed by digging into a dataset. This can be applied to a broad range of use cases, such as detecting bugs in software, supply chains or production processes, detecting abuse of systems, or detecting system failures.

A century in the making

Even before computers were ubiquitous, data has been manipulated in this way, but the process was manual, slow and required skilled analysts to collate, interpret and present the data in a meaningful form. The term ‘data mining’ was coined in the 1990s, with the practice previously referred to as knowledge discovery, using databases that were basic by today’s standards.

Technology was first used to mine data more than 100 years ago when the US Census Bureau reduced the time it took to analyse census results down from 10 years to just a few months, using punch cards and a tabulation machine.

These days data mining software adds artificial intelligence and machine learning to the original data science discipline of statistics, with cloud computing bringing extra processing power and data storage capabilities.

These advances in technology have resulted in an explosion of data mining, with ever-more complex data sets being analysed to uncover relevant insights. The intelligence gained is used in a diverse range of verticals, including retail, banking, manufacturing, telecommunications, agriculture and insurance. Use cases include selling products online, risk analysis, discovering financial fraud, or even optimising the growth of vegetables on farms.

Characteristics of data mining

Before any data is involved, organisations need to set their business objectives, with stakeholders and data scientists working together to define a business problem — and its associated context to inform the questions and parameters that the data mining project will encompass.

Next, data scientists will identify the data that will help them answer the questions that need to be answered. The process of mining data to create valuable information relies on accurate, reliable data collected from relevant sources, so choosing the right data is key.

Once the data has been identified, it needs to be cleaned-up and structured into a format that can be easily compared by available data mining tools. This includes removing any duplicated data and outliers. Next comes the process of building models and mining the data for patterns and correlations. Depending on the complexity of the data, deep learning algorithms can also be applied to classify or cluster a data set.

Once the data has been analysed and processed, the insights that have been generated can be passed on to the individuals who will use what has been discovered to inform and assist their decision making.

The challenges of data mining

Locating and gathering data

One of the major challenges organisations face when undertaking a data mining project is discovering and then connecting all of their different repositories of data.

In a modern enterprise, data is stored in applications such as spreadsheets, databases, ERP, accounting software and on social media. This data is in a variety of structured and unstructured formats — increasingly encompassing data generated by IoT sensors and cameras.

In addition, data is often siloed in different parts of the business, meaning it can be a challenge to source all of the relevant and associated information to get a full picture what the data represents. It can also be located in different types of infrastructure, including on-premises, private cloud and public cloud.

The raw data therefore needs to be located, then gathered in all its various formats. It then needs to be ingested into a central repository, or data lake, where it can be cleaned up and formatted before analytics tools can be put work.

Removing errors and inconsistencies

Mistakes or errors contained in the raw data, including duplications and errors introduced during the collection process, will generate unreliable results that could lead to poor decisions for the organisation. Preparing the raw data is therefore key, with all anomalies removed.

Another issue is the different formats in which data will be presented. As well as data from internal sources, there will be external data to deal with, including news feeds, stock/commodity prices and currency exchange rates. These can all affect decisions made by a company when setting product prices, making investments, choosing a target market.

The fields in which the data is entered therefore need to be standardised to ensure the information can be effectively read by analytics and visualisation tools once ingested into the data lake.

Manual processing

The data that will be mined first needs to be transported, transformed and visualised. If any of these processes are manual, they can be time consuming but also carry the risk of introducing new errors into the data.

Automating these processes reduces the chances of new errors and speeds up the process, making it possible to generate insights more quickly, and in some cases in real-time.

Scalability

With the amount of data now available to organisations, scaling to process it all effectively can be another challenge. With on-premises data centres, it has historically been difficult for organisations — particularly small and medium-sized businesses – to easily expand their compute capacity. Often, it will require new hardware to be purchased, installed and maintained — something that many organisations can’t justify.

Now, with cloud-based data storage and processing, organisations can pay to scale up compute capacity to deal with larger and more complex data sets. Once the data mining has been carried out, organisations can move the data to lower-cost storage and stop paying for the data processing.

Data security

Often data contains intellectual property, personally identifiable information, sales figures, accounts, and other confidential information. Data security is therefore vital — both while the data is at rest and while it is in use.

Data in use is located in active memory where it is most vulnerable. One protection for data in this state are security tools that allow regions of memory, or enclaves, to be protected and only accessible by processes from inside the assigned enclave.

Another approach is federated learning, where organisations apply machine learning and AI algorithms to create and improve models without compromising datasets that include confidential information.

Data mining techniques

There are various approaches to data mining, for different types of insights. For example, association rules are a rule-based method for determining relationships between data variable. This approach is often used in analysis of typical shopping basket items, so companies can improve their understanding of how consumers buy certain products together, helping to drive cross-selling and provide recommendations.

Neural networks are deep learning algorithms that process training data by mimicking the connections in the human brain using layers of nodes. Each node consists of inputs, a bias and an output. If the output value exceeds a given threshold, a node is activated to pass data to the next layer in the network.

To classify or predict potential outcomes using classification or regression methods, decision trees use a visualisation that resembles the branches of a tree to show the potential outcomes of decisions.

Finally, the K-nearest neighbour, or the KNN algorithm, classifies data points based on their location and association to other data. It assumes similar data points can be found close to each other, then calculates the distance between data points to identify patterns in the data.

Data mining examples

Retail: Combining and analysing data from a customer’s browsing patterns and spending habits can help the retailer gain a deeper understanding of the types of customers visiting their sites and provide a more personal experience.

The company may want to provide different experiences for customers that spend a lot but visit infrequently, compared to those customers who spend a little but visit the website frequently.

Data mining techniques can help retailers to cross sell their products and increase revenue. For example, if a customer buys product A, then they may be interested in a complementary or related product B. This can also be used to offer that customer an alternative but similar product with a higher profit margin.

Data mining can also reveal the price elasticity of a customer — whether they will continue to buy a product or service if the price of a product is increased, and how likely are they to buy more if the product costs less. Companies would be able to use data mining to understand how their profits would be affected if they were to change the price of a product.

Insurance and finance: An insurance company might analyse data from customers that are applying for policies. If the customer fills in the form several times with different information to get the cheapest quote, that behaviour may be completely innocent. However if the customer chooses options that contradict information already stored about them from a previous purchase, it can raise a red flag for further investigation.

The banking sector has for years been using AI to monitor customer transactional data to track spending habits such as amounts usually withdrawn at ATMs, or types of products bought using their credit card. If the AI sees a customer withdraw an unusual amount from an unexpected location, or tracks a credit card purchase that doesn’t fit their usual habits, it could indicate fraud.

Data analysis is commonly used by financial institutions for loan applicants. A potential customer’s payment history, payment to income ratio, and credit history could be used to determine the risk of granting the loan and help set loan terms and interest rates.

The more data that is collected, the easier it becomes to distinguish between ‘normal’ behaviour and suspicious activities that might warrant investigation.

Agriculture: Data mining tools can also be used by farming businesses growing crops or other produce. By gathering and analysing data such as for irrigation levels, hours of sunshine, exposure to wind and other elements, nutrients (both naturally occurring in the soil, or added) and the risk of crops being eaten or damaged by wildlife, farmers should be able to determine the yield of whatever they are growing They can also identify areas where they can make changes in order to produce more crops, more quickly.

Complex operations: Data mining techniques can also be used to improve operational processes, such as identifying costly or time-consuming bottlenecks, inefficient processes, issues in the supply chain, or improving decision making. Sometimes referred to as ‘process mining’, this can also monitor processes and measure improvements, assist compliance, and analyse many different functions, including contact centres.