Data Mining: What Can Data Conceal?
In recent times, there has been a lot of talk about the rise of artificial intelligence (AI) and its potential impacts, mainly due to the breakthrough of chatGPT Almost everyone has questions:
- How will AI affect our lives?
- Will it really take my job?
- How can I participate in the AI revolution?
- How can my company operate more efficiently with the help of AI?
Meanwhile, it is also widely discussed that AI did not start with chatGPT. We have talked about this in our podcast, presented at the University of Pécs' Artificial Intelligence Professional Day,and at various corporate events upon request from our interested clients.
can easily lead to misunderstandings. That's why we are launching our article series, in which we will try to compile and explain the most important professional terms in this field from a business perspective.
What is Data Mining?
In our first article, we will explore data mining. Many of today's artificial intelligence experts started their careers as data miners, and there are many overlaps between the various concepts.
The goal of data mining is to extract valuable information and patterns from large datasets. In the business world, this means finding something in the data that can be used to support business objectives, such as:
- We gain a better understanding of our customers and can communicate with them more effectively.
- We understand customer behavior and can respond to their needs in a timely manner.
- We identify suspicious credit card transactions and prevent potential fraud.
Unlike traditional statistical methods, the size of the dataset is a crucial aspect here. Data mining techniques and algorithms are characterized by their efficiency with large volumes of data. For example, they can analyze millions of purchase transactions or interactions between customers without needing to take smaller samples and potentially losing relevant information due to sampling.
How Can I Use Data Mining in Business?
Data mining has proven its value in almost every sector of business:
- Optimizing Existing Processes
- Developing New Processes
- Establishing New Revenue Streams
Characteristics of Data Mining Projects
It is also important to highlight another characteristic of data mining: sometimes, we work with data that, due to its size, was not previously used by the organization, making it difficult to predict whether it contains valuable information. This is particularly significant during the preliminary return-on-investment analysis of the project, as we cannot always estimate the expected profit, while the costs are immediately apparent.
Thus, the ROI (Return on Investment) calculation cannot be performed with high quality. Fortunately, there is now a wealth of experience in data mining projects, so with the right expertise and experience, we can provide some estimate that is sufficiently accurate for a project's approval process. However, it is important to emphasize that the organization must understand the characteristics of data mining projects, and the processes must be prepared to handle them appropriately.
But where do I store all this data?
The prerequisite for data mining activities is having access to data. There are several approaches to this:
- Data Lake: the large amount of collected data (e.g., from source systems, the web) can be loaded into a data storage layer where it can be managed efficiently and cost-effectively.
- Data Warehouse: Generally, large enterprises create data warehouses to systematically and structurally store essential data available within the company.
- Data Mart: A data storage layer that supports a specific business function of the company, where data is specifically prepared for reporting, analysis, and data mining purposes for that particular business area.
The relationship between data mining and project management
There are several approaches to managing data mining projects. One of the most widespread is the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, which many organizations follow with minor to significant modifications.
One of the main advantages of this methodology is that it provides a framework for managing data mining projects that can be applied regardless of the area of application.
The methodology consists of six phases, through which the project progresses iteratively. This means that a phase may conclude with results that necessitate returning to a previous phase and proceeding again through the appropriate phases:
- Identifying business objectives (Business Understanding)
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
How can one verify that complex logic is functioning correctly?
Data mining is characterized by processing large volumes of data that cannot be easily reviewed in tools like Excel. Therefore, it is often necessary to use Big Data solutions for effective data management. Additionally, data mining procedures often involve complex mathematical algorithms that may not always be easily interpreted from a business perspective.
Equally important—yet often neglected—is the task of verifying and testing data mining models and the data assets they process.
- Have all available and relevant resources been utilized?
- Is the quality of the data used adequate?
- Is the model's result relevant from the perspective of the actual business objectives?
- Could the application of the model have unintended negative consequences for the business operations?
By applying appropriate testing and validation methodologies, unintended negative consequences of data mining results can be avoided.
What tools do data miners use?
During data mining, a variety of tools can be used; however, some are more widely adopted.
Among programming languages
- SQL: Provides excellent capabilities for data extraction and potential transformation.
- Python and/or R: These can cover the entire data mining process, but many use them alongside SQL for data manipulation, model development, and validation. In recent years, Python has started to dominate over the two languages.
Low code / No code tools:
In addition, all relevant tools for performing data mining activities are also available in the service offerings of major cloud providers:
How does data mining relate to Generative Artificial Intelligence?
We often tell business representatives that they don't need to worry too much about whether a solution to a business problem falls under data mining, machine learning, or artificial intelligence. There are overlaps between these areas, and it is rare for someone to be concerned about achieving business value with only one approach or the other.
Generative artificial intelligence (e.g., ChatGPT) has, however, heavily benefited from both text mining and machine learning. From text mining, Natural Language Processing (NLP) has provided essential foundations, while from machine learning, neural network approaches have laid much of the groundwork. This combination enables engaging conversations today with systems like ChatGPT (OpenAI) or Gemini (Google).