In the first part of this two-part series, I covered the mostly non-technical aspects of building a data culture. While the lion’s share of the work will be getting people to work together and embrace ever deeper use of data, as a reader of Simple-Talk, a lot of this transition will be technical.
In this article, I will continue the discussion of building a culture of data by enumerating and discussing the technical aspects that need to be addressed. These aren’t all exactly culture issues, but a positive data culture will almost certainly include these items and enable them. The main things that need to be solved aren’t technology issues, but these are some obvious items that will make the path easier.
A Note on Data Democratization
Data democratization is all about making data more accessible to users, and it seems like it is exactly what was covered in part one, and again in this article. But it is not the same thing as a culture of data. A lot of users that are part of this culture of data may not know anything about the data itself, other than using reports given to them as opposed to having unfettered access to the data itself.
I think enterprises need to tread carefully if they have a goal of data democratization. Many very technical users don’t use data in a way that I would define as correct. They have trouble with basic concepts if their main focus isn’t data. This is quite a lot to expect from the rest of the enterprise. Managing expectations of both end-users and management is key for this to be successful.
Technical Standards
Common patterns and approaches for specific problems fit very well with data projects. Creating template packages and approaches for ETL ensures that solutions are consistent. This is very similar to the argument for standardizing coding style. It is much more difficult to maintain ETL that uses different patterns for the same problem. A small set of templates and patterns is easier in many ways. It also allows junior-level developers to become productive faster.
The tools used between departments and especially between different business units in enterprises will differ. The important thing is having a standard process and a standard set of tools within departments. These tools should be a subset of the tools used at an enterprise level. It also helps to have a list of preferred tools and tools that are currently approved but will be deprecated in the future. You don’t want teams using outdated tools for new projects.
Non-functional requirements (NFR) are another item that should be standardized. NFRs include the style items mentioned above but also include things like performance requirements, security requirements, general usability, and scalability requirements. NFRs won’t be the same for every project, but they can generally be grouped into categories. Enterprise-level teams can help set the standards and the possible architecture to meet some of those standards.
Data Discoverability
To be useful, data needs to be discoverable. This is especially true if you are striving for data democratization. Many enterprises use a system of asking around until they find someone who knows someone who might understand the data. If that sentence is confusing, you know how it feels to use this process. A business glossary and data dictionary tool are required to make data discoverable. They allow users to find data but also help data professionals find data and reduce data duplication. This is a location to record business definitions, technical definitions, and algorithms, record data owners, and note responsible teams. Some also record data lineage as attributes move from system to system and do some basic data profiling. There isn’t a good substitute for a good data dictionary system.
Data discoverability is a tremendous amount of work. It also includes enforcing naming standards and expectations from the business. This has immediate benefits for users, but it also improves later system integrations. Whether this is for company mergers or system-to-system integrations, the quality of data has a direct impact on how well this works. Sources for data need to be tracked down. Data elements need to be defined. Systems need to be mapped completely if they are to be merged. If that work has already been done it makes the integrations much easier.
Reducing Duplication
Another issue is data duplicated all over the place. Not duplicated in the sense of multiple rows in a table with the same data, but the same table in multiple databases or systems. If you have a plan in place for sharing data, it is easier to avoid this issue. Whether that is a virtualization platform, a monolithic warehouse, or a cohesive set of data lakes or lake houses, a plan for decreasing data duplication is imperative.
Duplicated data is costly from a resource perspective, both hardware and development time. It is also confusing for users when the same (or worse,) similar data is available in multiple locations. You also have to fix errors in all of these systems instead of one or two. You also need to be sure the data is secure in every location. If one system has lax security, it doesn’t matter how strict you are in the primary system. That data is available.
Error Management
A strategy needs to be in place for error management when you are serious about data quality. Data needs to be reliable and trusted. This needs to cover a few areas.
ETL Errors
The easiest type of error to manage is simply reporting or exposing data movement errors, ETL errors. These will be explicit errors that cause the ETL to fail. The errors can be due to network issues, bad passwords, bad or duplicate data where it isn’t expected, outdated metadata, deployment errors, or any number of other technical issues. Watching for these errors and correcting them quickly is a good sign of a mature data culture.
Out of Date Data
Data that doesn’t meet the movement SLA (data “freshness”) is a little more complicated to manage. It is more difficult to notice that something didn’t run versus an explicit error unless you are actively watching for missing or outdated data. Validation scripts can be used, but they add extra load on the servers. If you are already doing this for regression testing or this purpose, then you have the problem solved. A lighter-weight solution is to record when each ETL runs and compare that to the acceptable execution window. This can be exposed in a query or report.
Poor Data From the Source
Invalid data can be very difficult to deal with. These are things like a start date on a column after the finish date, negative numbers, or invalid values that should be in a list. This type of data error leads to cleanup tasks later. Ideally, additional quality checks also should be added at the start of the process. Adding these checks to the UI when the system is created prevents the errors from happening and ensures downstream systems don’t have to deal with them. A less desirable solution is to create exceptions for particular data errors. This tends to be a sloppy solution since every team using that data will need to implement the same exceptions. It requires additional programming or changes in multiple areas. It’s much easier to define your rules and insist on quality data from the start of the process.
If your data is from an untrusted source or one that can’t be modified, you may have to add sanity checks to the ETL ingesting the data. This is an approximate business rule that you implement during the ETL process. It can also be a set of reports you run after the ETL runs to give a sense of the data quality. Dealing with the results feeds back into the culture of data. You will have to negotiate with the source team on how these errors are resolved.
Security
Building a data-centered culture won’t solve all security problems, and access to more data may increase the chance that something is misconfigured or leaked. However an emphasis on data and educating all teams on data is a good step toward improving security. Security is technical by nature, but there are a few things that need to be addressed in a data-centric organization.
Adding a virtualization layer can help centralize security. The method you use to assign security may differ by the base system, but the virtualization layer can help. A virtualization layer also helps reduce data duplication.
Data extracts are a necessary evil in most environments. Reducing the number of extracts and files moved is a good step toward improving security. It’s always good to explore why data extracts are requested and make sure there isn’t a better solution. Once data is extracted, it’s much more difficult to secure.
Poorly implemented security is another indicator that the data culture needs to be improved. You don’t want gaps in your security, but you also want a security implementation that is easy to manage. This requires some basic documentation and closely followed standards. All users interacting with data need to understand the importance of keeping data secure. This doesn’t come naturally to most users and will need to be reiterated frequently. Good base security practices make it easier for users to understand security boundaries, but this is a good opportunity for training.
Best practices should be followed in data systems and make securing data easier. Look at my post on SQL Server security for some details. Basics include no shared accounts, the principle of least privilege (only assigning the minimum security necessary), no security applied directly to users – they should be in groups or SQL roles, the network should be locked down, especially for cloud service, and no default passwords. There are security best practices for specific products and those should be followed as well.
Audits
A final piece of building a culture of data is closely examining the current data maturity of your organization and ensuring your plan is getting followed in the months and years to come. The basic function of audits on your data culture is to ensure that standards are followed. You want to be sure there are no, or minimal, gaps in your implementation. These gaps will be either tech debt that needs to be addressed or just items that have been missed. Audits will also look at areas for improvement, including lack of consistency or opportunities to change processes. Audits can also show areas where training could help workers and the business.
When you see the term audits in a technical sense, you may think immediately of security. Security plays a big part, but it is more than just security. Audits are broadly categorized into technical audits, security audits, and business/process audits. Each of these areas is related to the others and the boundaries are somewhat arbitrary. It can help to separate the audits so that different things are emphasized. It also allows the work to be split into different, specialized teams. Each team will have some overlap, but that increases the chance that you will catch issues.
These shouldn’t be punitive audits. The purpose of the audit is to carefully examine how everything is getting run and see if the changes are effective. Or if the changes have even happened. The audits should be a feedback loop to the start of the process and allow everyone to evaluate success measures.
Infrastructure
The specific tools used shouldn’t impact your overall strategy or philosophy. But your infrastructure needs to considered. This helps with your NFRs, eases configuration issues, impacts cost, and has a direct impact on how teams work.
Infrastructure decisions impact all other areas. They can make things easier or much more difficult. This is a balancing act. Costs need to be compared against actual performance requirements. Security must be compared against ease-of-use.
Cost Management
Managing the cost of your data platforms should also be part of the overall plan. Implementing a major shift in the organization isn’t free. It has both a time and resource cost. The eventual goal is to save money or create opportunities to make more money in the long term.
Reducing redundancies, improving methodologies, standardizing tools, fixing data errors, and empowering users will all save money. But not in the short term. These things will take time and money, so be prepared for that. Someone, or even a whole team, needs to manage those costs and be sure resources are being used efficiently.
Relationship to Data Mesh
Data mesh has been a hot topic for a few years. Building a culture of data doesn’t mean that you need to build that culture using a data mesh philosophy. The actual technical implementation and philosophy can differ, but the important aspect is the focus on data. Pushing the responsibility for data sets and data products to the groups that know the data best makes sense in many ways, but it isn’t a requirement. You can still use virtualization or other technologies related to data mesh.
This also relates to self-service data. Unmanaged self-service data can become a problem akin to Excel files and Access databases. They serve a purpose but are an indicator that user needs aren’t getting met quickly enough. You can’t replace them completely, but they need to be managed. As an organization, you need to decide the path for self-service data. As mentioned, it can be a monolithic warehouse, a data mesh strategy utilizing a virtualization layer, or any other strategy. But you have to have a strategy. And the teams need to understand the strategy and work toward it.
Summary
As I end this two-part series on building a culture of data, I want to make it clear that doing this isn’t to follow some pre-prescribed set of tools moving to the new thing. Doing so becomes an objective without a use case. Technology isn’t the solution. It can help, (and really is a necessary part of the process) but overhauling business processes is the primary part of the solution.
Without a good set of processes in place (or at least designed), technology won’t help. It can even be detrimental. Don’t get me wrong, things like a good data dictionary/data glossary tool are useful components with or without a good data culture. The ability to find data in the enterprise and reduce overlap in your data stores is invaluable. But the discipline to use the data and tools is part of the mindset or you will fail. Not just of a single project, but of all projects exposing data and all projects using data.
Data quality issues are one of the biggest things I see when working with data. There is often a disconnect between the end-users and data specialists regarding data quality. It can be seen as overly strict or rigorous to insist on quality, but bad quality has many downstream impacts. Everyone must take personal responsibility for data quality. It can’t be an afterthought – it must be an integral part of your processes.
Load comments