Tidy Data, Tidy Types, and Tidy Operations
How to organize your data and choose allowed operations to support data science, machine learning, and data stewardship.
Hadley Wickham adapted a quote from Leo Tolstoy, when he proclaimed “Like families, tidy datasets are all alike but every messy dataset is messy in its own way”- [Wickham2014] and he stated that around 80% of all work is spent data wrangling. More recent studies still indicate that 50% to 70% of work time is spent during data wrangling, which essentially means that only 30% to 50% of the work time is left for generating insights and value. Untidy or messy data thus obstruct business and research ventures and represents a successively growing cost factor in the process of data stewardship with negative side effects, not only for your current business and value stream but also during opportunity exploration and research.
The second problem area is the use of correct data types and permitted mathematical operations for those types to obtain semantically valid and stable statements from your analysis. Thus the use of wrong data types and operations subtly endangers the validity of an analysis or the results of a machine learning model.
With the aim of mitigating or completely eliminating the above-mentioned risks, we will describe how to obtain “Tidy Data, Tidy Types, and Tidy Operations” with the aim of reducing data stewardship costs and increasing flexibility for actual data science and machine learning.
The foundtions of tidy data are closely related to the principles of Codd’s relational algebra, especially w.r.t Codd’s 3rd normal form and defines a standard “way of mapping the meaning of a dataset to its structure” - [Wickham2014]. Thus tidy data, e.g. data prepared for analysis, is defined as,
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
while data persisted in any other way is considered messy data!
Providing tidy data facilitates data scientist and machine learning engineers to easily extract necessary variables for their cause, because it provides a standard way of structuring data.
Ordering variables is not strictly necessary, but a well thought of order makes it easier to get a first overview of the data to be analyzed, thus it is recommended to put fixed variables first, followed by the measured variables,
- Fixed variables.
- Measured variables.
- Order by fixed, then measured variables.
with the five most common violations being
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
according to Wickham
If a relational database adheres to Codd’s 3rd normal form, it should be easy to extract arbitrary observational units for analysis and to persist analysis results. Furthermore, operating on tidy data is a fundamental step for adopting data stewardship practices.
Before presenting specific violation examples and how to tidy them using Python and R in the next blog post, we introduce typical data manipulation methods used for the purpose of data manipulation.
Data manipulation is mostly accomplished with the help of four simple base operations.
- Filter: Create a subset by removing or keeping observations based on a condition.
- Transform: Directly add or modify variables in question, either single or multiple variables at once.
- Aggregate: Calculate a single value from multiple values.
- Sort: Manipulate the order of observations.
Python with Pandas and R provide multiple functions implementing the four base operations. Those functions are typically modified with a “by” preposition for allowing calculations to be performed on distinct subsets.
Ideally, an operation on tidy inptut data results in equally tidy output data, because it is easy to combine tidy datasets, e.g. with the help of a join-operator.
Tidy Types and Tidy Operations
Although Wickham already described tidy types basically as single variable either being a string or numerical, we want to give an overview on those tidy types and their features, as they have implications regarding semantically valid operations for analysis. Thus value domains for types should be chosen with great care, as a transformation of a variable from one domain to another is possible if and only if the transformation preserves the structure of the original domain!
Tidy types come in five scales of measurement presented in the following taxonomy table adapted from the book Pattern Recognition.
|Trait||Nominal Scale||Ordinal Scale||Interval Scale||Ratio Scale||Absolut Scale|
|Empirical relation||Equivalence $~$||Equivalence $~$||Equivalence $~$||Equivalence $~$||Equivalence $~$|
|Ordering $<$||Ordering $~$||Ordering $<$||Ordering $<$|
|Empirical operation||Addition $+$||Addition $+$|
|Multipl. $\times$||Addition $+$|
|Allowed transformation||$m' = f(m)$||$m' = f(m)$||$m' = am + b$||$m' = am$||$m' = m$|
|injective||strictly incr.||with $a > 0$||with $a > 0$|
|Typical domain||Integers, Names,||Integers||Real numbers||Real numbers||Natural Numbers|
|Expressiveness||very low||low||medium||high||very high|
|Examples||Mobile numbers,||School grades,||Temp. in F,||Temp. in K,||Electron count,|
|Postal codes,||Degree of hard.,||Calendar time,||Electric current,||Euler character.,|
|Gender,||Wind intens.,||Geograph. alt.||Account balance,||Numb. of test fails.|
|Scale name||Scale expressiv.||Edge length|
To obtain correct analyses of high significance, it is necessary to select the correct data types and domains for an analysis on the one hand, and to perform only semantically correct operations on these data types on the other. This is what tidy types and tidy operations means.
Finally, it should be noted that tidy data should be constructed on the basis of correct types and domains, the so-called tidy types and operations on tidy input data, which should be derived from the operations allowed for the data domain to subsequently generate tidy output data for subsequent downstream analysis.
Keeping and maintaining data in a tidy form is one of the main concerns of data stewardship which facilitates and accelerates data analysis and machine learning operations.