Karri Lehtonen 07.09.2018 32 min read

Data Quality: “No Garbage In, No Garbage Out”

Our society runs on information. Without accurate and reliable information, our world would become dysfunctional. Frankly, pretty quickly everything would come to a grinding halt. This is while data and the data quality is more important than one would think. The information we rely on in our daily lives is compiled by using various data and data sources.

Nevertheless, data in its raw form is not particularly useful. First, data scientists need to "translate" the data into insights, so business users can use it to make better decisions.

In today’s connected world, organizations gather unbelievable amounts of data that is then further analyzed for various purposes. However, the big data has its advantages and disadvantages: some of the data is of good quality, while some data has low quality. Generating insights becomes difficult when the data is incorrect, inaccurate, or missing specific data fields. Having data at our disposal is simply not enough. It has to be flawless, 100% clean, high-quality data, so data integrity is crucial.

In this blog, we discuss the topic of data quality and how data quality can be improved using iPaaS solutions that connect with various data sources.

This guide covers the following topics:

What is Data Quality?

Data quality is not a strictly defined objective term. Under data quality, we instead mean a subjective assessment of the various factors that affect the quality of the data at hand. When organizations define what data quality means for their organizations, they take into consideration how they intend to use the data, identifying what aspects of the data they find vital for extracting valuable insights from the available information. In other words, data quality describes how well the data fits its intended use.

These assessment factors may include quantitative measures like the timeliness of the data, the accuracy of the data, and the completeness of the data. They may also include qualitative factors such as the reliability of the data or relevance of the data. The combination of the relevant factors defines the data quality under the given circumstances.

The subjective nature of data quality becomes obvious when different organizations or people assess the same set of data. Data that may seem like “good quality data” to one party may be considered “poor quality data” or even “total rubbish” by someone else. Still, the data itself may be identical, only the context that it relates to is different, or the scales of the assessment criteria used by each party are different. This may sometimes lead to disagreements about the data quality within an organization or between business partners.

Why good data quality is so important?

Most modern businesses rely on information and insights for decision-making. If the data that is the foundation of the decision-making process is of poor quality, the information becomes inaccurate or completely incorrect. This will cause some problems. Simply relying on inexact information will mislead the decision-makers. It can also affect the productivity of an organization, its competitiveness, or perhaps its compliance. All this will result in lost revenue. This can be fatal for a company. Examples of cases in which companies ignored the obvious because of not having accurate insights available are numerous. If this is something you want to read about, we suggest you read a whitepaper of SAS "When bad data happens to good companies." This is one of my personal favorites.

But it’s not only human decision-making that is affected by data quality. With the rise of new technologies like artificial intelligence and machine learning, data quality is even more critical than ever before. For these technologies to function as intended, they require the input data to be of impeccable quality.

Data quality management

Data quality management consists of multiple different components that are combined to achieve the desired outcomes. Data quality management is an ongoing process that can be described as a loop. The data quality management projects typically start with the identification and definition of the desired data quality objectives, i.e., what kind of results the organization wants or needs to achieve with its data quality management. After that, the data quality enhancement methods are selected. These can include, for example, data cleansing or data harmonization steps. The related business rules must be defined, as well as the performance targets should be set. Once the data quality management process is in place, the results of the data quality process need to be measured. The results are then often used as the basis of further data quality improvement efforts.

Many software applications have some basic built-in data quality management functionality, just like a rule for error detection or possibly also deletion of duplicate data records. The shortfall of these procedures is that they are quite limited and can only handle the data within a particular application.

There are also specialized software tools that have been designed for data quality management. Data quality management software can typically operate across multiple applications and databases. These solutions have traditionally been relatively expensive and cumbersome to deploy. Therefore, generally, only large organizations have chosen to purchase and use them. However, even these tools have their limitations. They are, by their nature, quite reactive. They fix data quality issues when they happen, but they do not prevent them from happening again.

While traditional data quality management tools may not be commercially feasible solutions for smaller organizations, there are other means for data quality management. Many mid-size organizations use a centralized data repository to achieve the “single source of truth” within all their applications. This approach is referred to as Master Data Management (MDM). A centralized master database should guarantee at least consistency of data across the organization, but it does not automatically ensure data quality. Centralized data can be equally bad quality as decentralized data.

An iPaaS solution can be used to tackle all the above-mentioned data quality management issues. Many organizations have started to use an iPaaS solution in combination with a master data management system to ensure data quality across all their applications. The iPaaS solution is used to integrate all business applications to attain a “single source of truth”. They use the iPaaS solution to ensure the quality of the data by integrating the master data management system with various internal and external data sources. The combination of the different sources helps to ensure that all the data would be complete, enhanced, and most importantly 100% correct. So, the iPaaS solution is used in this case as the data input tool as well as the data distribution tool. Typically, there are various business rules built and deployed within the iPaaS layer to resolve any discrepancies and conflicts between multiple datasets. An iPaaS solution can also be configured to read the master data management systems and update it with new data based on the agreed business rules.

Data quality issues

General data quality issues haven't really changed during the past few decades. Data quality has been stricken by the inconstancy of the data, the inaccuracy of the data, the incompleteness of the data (or missing data), the invalidity of data, and the list goes on. While all these “old data quality issues” still remain relevant today, there are also some relatively “new data quality issues” that have emerged. For example, today's system architecture with various cloud-based software solutions no longer caters to the traditional data quality management approach.

Data sources have also become more and more complex over time. We now use voice recognition, chatbots, natural language processing tools, and other unstructured data sources that have a direct impact on the data quality.

One of the significant issues concerning data quality has been the sheer volume of data that needs to be analyzed today. Data is often collected at an extremely rapid pace in huge volumes, which can make data quality management with traditional methods extremely challenging. Manual processes will no longer be an option for improving data quality. All procedures related to data quality management must be automated for operational efficiency.

The connected nature of our world also imposes new data quality issues. Organizations have no real control over the various data sources. At the same time, they need to utilize the data they receive (or, in the worst-case scenario, they do not even receive) from their business partners. Previously, there was very little a company could do to improve the quality of the data that came from external sources. Luckily, times are changing.

An iPaaS solution can be used to overcome these challenges in a practical way. A modern hybrid integration platform can easily integrate the old legacy systems as well as all new cloud-based systems and data sources. Using an iPaaS solution allows organizations to ensure data quality checks and quality improvement steps while moving data from one system to another, whether the data source is internal or external. If data quality from a data source is outside the selected criteria, it can be rejected before it enters any other systems in the mix. Also, when a data source provides incomplete data, that data can be completed in the iPaaS layer before it is further distributed. In a way, the iPaaS acts as a firewall that blocks poor-quality data before it enters any business application. 

Data Cleansing

Data cleansing means the detection and identification of errors in data and correcting them or sometimes even removing the affected data from the data set altogether. The goal of this process is to ensure that each data set is consistent with other similar data sets. So, with data cleansing, the input data is not validated and/or rejected at entry, but rather data cleansing is used for batches of data to ensure consistency. The reasons for data inconsistency may vary, but often the underlying reasons boil down to various user errors at the data entry point. Sometimes data may also get corrupted in a system or during transmission from one system to another.

Data cleansing may include the correction of data by removing typing errors or replacing inconsistent entry values with values that are acceptable for the system. The data correction rules may be “strict”, which typically results in the rejection of the data if some record is missing or corrupted, or “fuzzy”, in which case an acceptable record replaces the missing or corrupted value. Data cleansing may also mean that the original data is enhanced by adding new related data into it from another data source, different terminology within the data set is harmonized to be consistent and/or the data is standardized by ensuring that all information is transformed into one selected standard (e.g., UNSPSC).

Data Quality Control and Data Quality Assurance

Data quality control focuses on controlling the usage of the data by an application. This is done both before and after the relevant data quality assurance methods. Data quality control before the data quality assurance is used to restrict the input of incorrect data. Data quality control after the data quality assurance is used to identify inconsistency, inaccuracy, and incompleteness of output data as well as any missing data. Data quality control is used to find data quality issues and exceptions that remain undiscovered by data assurance operations. Data quality controls are made redundant if/when business logic covers the same functionality and fulfills the same purpose as data quality control.

Data quality assurance means the combination of different efforts that are used to ensure that the data quality stays at the desired level. In other words, it is the way to prevent incorrect data from being used by a system resulting in further inaccurate information. Data quality assurance differs from data quality control as the latter focuses more on checking the quality of the data, whereas data quality assurance focuses on the quality process itself. Data quality assurance consists of several steps, including the measurement of the data, comparing the data against selected standards, monitoring the process, and providing feedback that can be used for basic error correction and further process improvements. Data quality assurance is often associated with a broader quality system such as ISO9000.

Data Quality and Data Cleansing Tools

As mentioned above, individual software applications may have built-in data control tools, but also generic data quality tools have been used to help to strengthen the quality of the customer, product, or financial data. Such tools may include various levels of the following data quality functionalities: data cleansing, standardization of the data, parsing the data, matching the data with related data sets, enriching the data, monitoring the data and analyzing the data quality. Traditionally data quality tools have been provided by large software vendors such as IBM, SAS, Informatica, etc. (If you are looking for data quality tools, check out the Magic Quadrant for Data Quality Tools here.) Probably, this is the reason why data quality tools have been seen as relatively expensive and not easily accessible for smaller companies.

Data cleansing tools are a more specialized sub-set of data quality tools focusing mainly on identifying incomplete, incorrect and/or inaccurate data and then replacing, modifying, or deleting this data in a database, thereby cleansing the data. Data cleansing solutions are relatively often modern and do not require as extensive deployment projects as traditional data quality software have, making them more accessible by smaller companies as well.

However, as data quality has become more and more critical, some data integration solutions now include data quality management functionalities. A modern iPaaS solution is a perfect tool for parsing, harmonizing, enriching, monitoring and cleansing data moving through the iPaaS solution. Business rules can be developed into the integration layer. These rules help to harmonize the data already before it enters a database or a system. With an iPaaS solution, companies, in fact, get two solutions for the price of one: an integration solution as well as a data quality tool. This is one of the many additional advantages that a modern cloud-based integration platform provides.

Data Quality Firewall

Probably no existing enterprises are making international trade that wouldn’t have established EDI connections to transfer data with their trade partners. The connections are often point-to-point integrations with managed file transfer. The technologies used for building connectivity across systems to share information is most of the times based on relatively old technology and, therefore, often quite limited in functionality. This is the main reason why these technologies typically cannot be used for improving data quality.

Data quality is an enormous challenge for all companies that want to utilize the information at their disposal. Low or extremely poor-quality data is almost completely useless, it’s challenging to make insights on information that is missing data fields or contains errors. Instead of working with the data as well as possible, companies use a lot of time making manual fixes. In the end, they still have bad data, and they endeavor to overcome the inefficiencies they face due to errors and missing fields in the data.

Low-quality data harms your backend processes, as you will have to do a lot of manual work to fix problems with the data. When processes are manual, it’s incredibly time-consuming, it’s easy to make mistakes or sometimes even to commit fraud.

Poor quality data costs you more than just the time of your labor force that they spend with improving the data. What is even more important – and often overlooked – is that bad data may also result in second-class customer experience. Simply, in a world where customer experience is crucial, you can’t afford to risk your reputation among your customers. The same goes for your ecosystem. Bad data may also harm your relationships with your trading partners, as you may not be able to cooperate adequately when the information you have is incorrect.

So what are your options?

You may have thought of ditching all your current technologies and investing millions in application development. If it only would be so simple. However, it would be a long and painful project, and you can never be sure about the results.

Instead, investing in application and system integration can be an option for you. You need to look for a solution that doesn’t only offer a system and data integration but that comes with solutions for improving the data quality as well. In the end, implementing an integration layer that takes care of connectivity and data transmission will be faster and cheaper than revamping your entire technology backbone.

We have come across this same issue in many of our customer cases. Most of them have been operating on top of legacy technologies that was way too business-critical for their operations just to start changing it entirely.

Besides the legacy systems, they have also been struggling with the poor quality data they receive through their traditional EDI connections. Instead of completely changing their systems, they realized that a cloud-based integration solution could be the remedy for their pain. The solution can be implemented on top of the cloud, and it would complement their existing technologies and EDI connections and provide them with a “data quality firewall” to that eliminates their data quality issues.

We will come to the benefits of the data quality firewall in a minute. You may ask us, how a new integration layer is going to impact your existing connectivity and data integration solutions? The answer is simple: it can merely complement the existing one, or it can also replace them. Even if one day you decide to change your systems or applications, the connectivity solution can be easily migrated in case your applications would change.

Let’s get back to the topic of the data quality firewall and how it fits the bigger picture of integrations. The Youredi data quality firewall is part of the connectivity solution whenever the customer has a need for it. We set up technical and business processes to be part of the integrations. These rules are meant to ensure that your data is always the highest quality. This is like a sidestep when we validate all the information against your business rules. Once we have validated the information as per request, we then make sure that the data is enriched. The enrichment could happen just by forwarding the messages back to the sender, or we could use the master data to complete all your required information.

It’s almost a no-brainer, but it’s good to mention that just like in all integration cases, in this one too, the customer owns all the data. To ensure the security of the data, we may perform several encryptions during the data transfer.

Also, it’s challenging to work with certain legacy technologies. With hybrid integration solutions, You don’t usually have to do any modifications to your existing systems. You may leave them “as is” and as a vendor specialized in developing and deploying any types of integrations (yes, also hybrid ones), we can take care of the connectivity and setting up the data quality firewalls.

The end result will be a well-connected enterprise that has also ensured that the data that is flowing into its systems is always of impeccable quality.

You have a lot of data - but is it good quality?

The never-ending expansion of volume, velocity, and variety of data, as well as the ever-increasing importance of data quality, put more and more pressure on data quality management. Data must be “fit for purpose” and also be “right the first time” (meaning that mistakes need to be eliminated from the equation so that they are not repeated over and over again). Data quality matters should always be considered and addressed when an organization makes decisions on its IT infrastructure. iPaaS technology provides probably the most cost-efficient way to ensure premium data quality for all enterprise applications. Additionally, iPaaS helps to improve the quality of data originating from third-party data sources (such as suppliers, trading partners, and customers).

One can say that good system integration solutions and sound data quality go hand in hand. Therefore, when implementing and using a system integration solution, organizations should always develop a data quality strategy that supports the data quality objectives they have. The data quality strategy can be an essential part of your integration strategy. With a modern system integration solution, the data quality strategy can be implemented in practice effectively with minimal effort. Integration platform solutions will also continuously ensure the required data quality as they are not just one-time data cleansing efforts but provide continuous data quality management, which should be the objective for all organizations.

After all, no successful organization can afford to live in a “garbage in, garbage out” situation anymore.

Hopefully, you've found this blog useful, and we were able to put iPaaS into a new perspective as a great tool for fixing data quality issues. Do you want to learn more about integration platforms? Get our ebook now:

Download the eBook

Karri Lehtonen

Data integration expert

Speak to us About your Integration Needs!

If you have integration-related challenges or questions about Youredi's integration solutions, please book a 30 min time slot to discuss them in-depth with our team.
Book a Call

SUBSCRIBE TO OUR BLOG AND NEWSLETTER

The latest News and Insights from the Supply Chain and Logistics Industry.