Can A Dirty Read Occur If A Clean Read Already Occurred

The Ultimate Guide to Data Cleaning

When the data is spewing garbage

I spent the final couple of months analyzing information from sensors, surveys, and logs. No matter how many charts I created, how well sophisticated the algorithms are, the results are e'er misleading.

Throwing a random woods at the data is the same every bit injecting it with a virus. A virus that has no intention other than pain your insights every bit if your data is spewing garbage.

Even worse, when you prove your new findings to the CEO, and Oops guess what? He/she plant a flaw, something that doesn't smell right, your discoveries don't lucifer their agreement virtually the domain — After all, they are domain experts who know ameliorate than you, you as an analyst or a developer.

Right away, the claret rushed into your face, your easily are shaken, a moment of silence, followed past, probably, an apology.

That's slap-up at all. What if your findings were taken every bit a guarantee, and your visitor ended upward making a decision based on them?.

You ingested a bunch of dirty data, didn't clean it upwards, and you told your company to do something with these results that turn out to be incorrect. Yous're going to be in a lot of trouble!.

Wrong or inconsistent data leads to false conclusions. Then, how well you clean and sympathise the data has a high affect on the quality of the results.

Two existent examples were given on Wikipedia.

For instance, the authorities may desire to analyze population census figures to determine which regions require further spending and investment on infrastructure and services. In this example, it will be important to have access to reliable information to avoid erroneous financial decisions.

In the business world, incorrect data can be costly. Many companies use customer information databases that record data like contact data, addresses, and preferences. For instance, if the addresses are inconsistent, the visitor will suffer the cost of resending post or even losing customers.

Garbage in, garbage out.

In fact, a uncomplicated algorithm can outweigh a complex one simply because it was given enough and high-quality information.

Quality information beats fancy algorithms.

For these reasons, information technology was important to take a footstep-by-pace guideline, a cheat canvass, that walks through the quality checks to exist applied.

But first, what's the thing nosotros are trying to achieve?. What does it mean quality information?. What are the measures of quality data?. Understanding what are yous trying to achieve, your ultimate goal is disquisitional prior to taking whatsoever deportment.

Index:

Data Quality (validity, accuracy, abyss, consistency, uniformity)
The workflow (inspection, cleaning, verifying, reporting)
Inspection (data profiling, visualizations, software packages)
Cleaning (irrelevant data, duplicates, blazon conver., syntax errors, half-dozen more than)
Verifying
Reporting
Final words

Data quality

Frankly speaking, I couldn't find a better explanation for the quality criteria other than the one on Wikipedia. So, I am going to summarize it here.

Validity

The degree to which the information conform to defined business rules or constraints.

Data-Type Constraints: values in a particular column must exist of a particular datatype, eastward.k., boolean, numeric, engagement, etc.
Range Constraints: typically, numbers or dates should fall within a sure range.
Mandatory Constraints : certain columns cannot exist empty.
Unique Constraints: a field, or a combination of fields, must be unique across a dataset.
Set up-Membership constraints: values of a column come up from a set of detached values, e.thousand. enum values. For example, a person'due south gender may be male or female.
Strange-cardinal constraints: every bit in relational databases, a foreign central column can't have a value that does not exist in the referenced master fundamental.
Regular expression patterns: text fields that have to be in a sure design. For instance, phone numbers may exist required to have the blueprint (999) 999–9999.
Cantankerous-field validation: sure weather condition that bridge across multiple fields must concur. For case, a patient's date of discharge from the hospital cannot be earlier than the appointment of admission.

Accuracy

The degree to which the data is close to the truthful values.

While defining all possible valid values allows invalid values to be easily spotted, information technology does not hateful that they are accurate.

A valid street address mightn't actually exist. A valid person'due south eye color, say blue, might exist valid, only not true (doesn't represent the reality).

Another matter to note is the difference between accurateness and precision. Proverb that you live on the earth is, actually true. But, not precise. Where on the earth?. Proverb that yous live at a detail street address is more precise.

Completeness

The degree to which all required data is known.

Missing data is going to happen for various reasons. I tin mitigate this trouble past questioning the original source if possible, say re-interviewing the field of study.

Chances are, the subject area is either going to requite a dissimilar reply or volition be hard to achieve again.

Consistency

The caste to which the information is consistent, within the same information set or beyond multiple information sets.

Inconsistency occurs when two values in the information set contradict each other.

A valid historic period, say 10, mightn't lucifer with the marital status, say divorced. A customer is recorded in two different tables with two different addresses.

Which ane is true?.

Uniformity

The caste to which the data is specified using the same unit of measure out.

The weight may be recorded either in pounds or kilos. The date might follow the USA format or European format. The currency is sometimes in USD and sometimes in YEN.

And then data must be converted to a unmarried measure unit.

The workflow

The workflow is a sequence of three steps aiming at producing high-quality data and taking into account all the criteria we've talked well-nigh.

Inspection: Discover unexpected, wrong, and inconsistent data.
Cleaning: Fix or remove the anomalies discovered.
Verifying: Later on cleaning, the results are inspected to verify correctness.
Reporting: A study about the changes made and the quality of the currently stored data is recorded.

What you lot see as a sequential process is, in fact, an iterative, endless procedure. One can go from verifying to inspection when new flaws are detected.

Inspection

Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error detection. Hither are some of them:

Data profiling

A summary statistics about the data, chosen data profiling, is really helpful to give a general idea about the quality of the data.

For case, cheque whether a particular column conforms to particular standards or blueprint. Is the data cavalcade recorded equally a string or number?.

How many values are missing?. How many unique values in a column, and their distribution?. Is this data gear up is linked to or have a human relationship with some other?.

Visualizations

By analyzing and visualizing the information using statistical methods such equally mean, standard deviation, range, or quantiles, ane tin detect values that are unexpected and thus erroneous.

For case, past visualizing the average income across the countries, one might come across there are some outliers (link has an image). Some countries have people who earn much more than anyone else. Those outliers are worth investigating and are not necessarily incorrect information.

Software packages

Several software packages or libraries bachelor at your language will let yous specify constraints and cheque the data for violation of these constraints.

Moreover, they tin non only generate a written report of which rules were violated and how many times but also create a graph of which columns are associated with which rules.

The age, for case, can't be negative, and so the height. Other rules may involve multiple columns in the same row, or across datasets.

Cleaning

Information cleaning involve different techniques based on the trouble and the information blazon. Different methods can be applied with each has its own trade-offs.

Overall, incorrect information is either removed, corrected, or imputed.

Irrelevant information

Irrelevant data are those that are not actually needed, and don't fit under the context of the problem we're trying to solve.

For example, if we were analyzing data almost the general health of the population, the phone number wouldn't be necessary — column-wise.

Similarly, if you were interested in only i detail country, y'all wouldn't want to include all other countries. Or, study only those patients who went to the surgery, we wouldn't include everyone — row-wise.

Only if yous are sure that a slice of data is unimportant, yous may drop information technology. Otherwise, explore the correlation matrix between characteristic variables.

And even though you noticed no correlation, y'all should ask someone who is domain expert. You lot never know, a feature that seems irrelevant, could be very relevant from a domain perspective such as a clinical perspective.

Duplicates

Duplicates are data points that are repeated in your dataset.

It often happens when for example

Data are combined from different sources
The user may hit submit push button twice thinking the form wasn't actually submitted.
A request to online booking was submitted twice correcting incorrect information that was entered accidentally in the first fourth dimension.

A common symptom is when two users have the aforementioned identity number. Or, the aforementioned article was scrapped twice.

And therefore, they simply should be removed.

Type conversion

Make certain numbers are stored every bit numerical data types. A date should be stored equally a engagement object, or a Unix timestamp (number of seconds), and so on.

Chiselled values tin can exist converted into and from numbers if needed.

This is can be spotted quickly by taking a peek over the information types of each cavalcade in the summary (nosotros've discussed above).

A word of caution is that the values that can't be converted to the specified type should be converted to NA value (or any), with a warning being displayed. This indicates the value is incorrect and must be fixed.

Syntax errors

Remove white spaces: Actress white spaces at the commencement or the terminate of a string should be removed.

          "   hello earth  " => "howdy world

Pad strings: Strings can be padded with spaces or other characters to a certain width. For instance, some numerical codes are oftentimes represented with prepending zeros to ensure they always have the same number of digits.

          313 => 000313 (half dozen digits)

Fix typos: Strings can be entered in many different ways, and no wonder, can have mistakes.

                      Gender            
m
Male person
fem.
FemalE
Femle

This categorical variable is considered to have 5 different classes, and not 2 as expected: male person and female since each value is different.

A bar plot is useful to visualize all the unique values. I can discover some values are dissimilar but exercise mean the same thing i.e. "information_technology" and "It". Or, perhaps, the departure is just in the capitalization i.east. "other" and "Other".

Therefore, our duty is to recognize from the above information whether each value is male or female. How can we do that?.

The first solution is to manually map each value to either "male person" or "female person".

          dataframe['gender'].map({'            m            ': 'male',             fem.            ': 'female person', ...})

The second solution is to employ pattern match. For example, we tin can wait for the occurrence of chiliad or Grand in the gender at the outset of the string.

          re.sub(r"\^one thousand\$", 'Male person', 'male', flags=re.IGNORECASE)

The third solution is to use fuzzy matching: An algorithm that identifies the distance betwixt the expected string(s) and each of the given one. Its bones implementation counts how many operations are needed to turn one string into another.

                      Gender   male  female            
m         3      5
Male      1      3
fem.      5      3
FemalE    iii      ii
Femle     3      1

Furthermore, if you lot have a variable like a city proper noun, where y'all doubtable typos or like strings should exist treated the same. For example, "lisbon" can exist entered as "lisboa", "lisbona", "Lisbon", etc.

                      City     Distance from "lisbon"
            lisbon       0
lisboa       1
Lisbon       one
lisbona      2
london       3
...

If so, then nosotros should replace all values that hateful the aforementioned thing to one unique value. In this example, replace the starting time 4 strings with "lisbon".

Sentry out for values like "0", "Not Applicable", "NA", "None", "Null", or "INF", they might mean the aforementioned affair: The value is missing.

Standardize

Our duty is to non only recognize the typos but likewise put each value in the aforementioned standardized format.

For strings, make sure all values are either in lower or upper case.

For numerical values, brand sure all values have a certain measurement unit.

The hight, for case, tin can be in meters and centimetres. The difference of ane meter is considered the aforementioned every bit the difference of ane centimetre. So, the job hither is to convert the heights to i unmarried unit.

For dates, the Usa version is not the same as the European version. Recording the date every bit a timestamp (a number of milliseconds) is not the same as recording the date every bit a date object.

Scaling / Transformation

Scaling means to transform your data so that it fits inside a specific calibration, such as 0–100 or 0–1.

For case, test scores of a student can be re-scaled to exist percentages (0–100) instead of GPA (0–5).

It can also help in making sure types of data easier to plot. For example, we might want to reduce skewness to assist in plotting (when having such many outliers). The most commonly used functions are log, square root, and inverse.

Scaling can also take identify on data that has unlike measurement units.

Educatee scores on different exams say, Sat and ACT, can't be compared since these 2 exams are on a dissimilar calibration. The deviation of 1 SAT score is considered the same as the difference of 1 ACT score. In this example, we need re-calibration Sabbatum and Deed scores to take numbers, say, between 0–1.

Past scaling, we can plot and compare different scores.

Normalization

While normalization also rescales the values into a range of 0–1, the intention here is to transform the data and so that information technology is normally distributed. Why?

In most cases, we normalize the data if nosotros're going to exist using statistical methods that rely on normally distributed data. How?

One tin use the log function, or perhaps, utilize one of these methods.

Depending on the scaling method used, the shape of the information distribution might alter. For example, the "Standard Z score" and "Student'due south t-statistic" (given in the link above) preserve the shape, while the log part mighn't.

Missing values

Given the fact the missing values are unavoidable leaves u.s.a. with the question of what to practice when nosotros encounter them. Ignoring the missing data is the same as digging holes in a gunkhole; It will sink.

There are iii, or peradventure more, means to deal with them.

— One. Drop.

If the missing values in a column rarely happen and occur at random, so the easiest and most forward solution is to drop observations (rows) that have missing values.

If nigh of the column's values are missing, and occur at random, then a typical determination is to drop the whole column.

This is particularly useful when doing statistical analysis, since filling in the missing values may yield unexpected or biased results.

— 2. Impute.

It means to calculate the missing value based on other observations. There are quite a lot of methods to do that.

— First one is using statistical values similar mean, median. Even so, none of these guarantees unbiased data, especially if there are many missing values.

Mean is most useful when the original data is not skewed, while the median is more robust, not sensitive to outliers, and thus used when data is skewed.

In a normally distributed data, ane can get all the values that are within 2 standard deviations from the mean. Side by side, fill in the missing values past generating random numbers betwixt (mean — two * std) & (mean + 2 * std)

          rand = np.random.randint(average_age - two*std_age, average_age + ii*std_age, size = count_nan_age)          dataframe["historic period"][np.isnan(dataframe["age"])] = rand

— Second. Using a linear regression. Based on the existing data, i tin summate the best fit line between ii variables, say, house price vs. size m².

It is worth mentioning that linear regression models are sensitive to outliers.

— Tertiary. Hot-deck: Copying values from other like records. This is only useful if yous have enough bachelor information. And, it can exist applied to numerical and categorical data.

One can accept the random approach where we fill in the missing value with a random value. Taking this approach i step further, one can get-go separate the dataset into ii groups (strata), based on some feature, say gender, and and so make full in the missing values for different genders separately, at random.

In sequential hot-deck imputation, the column containing missing values is sorted according to auxiliary variable(south) so that records that have similar auxiliaries occur sequentially. Side by side, each missing value is filled in with the value of the first post-obit available record.

What is more interesting is that 𝑘 nearest neighbour imputation, which classifies similar records and put them together, can besides be utilized. A missing value is then filled out by finding start the 𝑘 records closest to the record with missing values. Next, a value is chosen from (or computed out of) the 𝑘 nearest neighbours. In the case of calculating, statistical methods like mean (as discussed before) can be used.

— Three. Flag.

Some argue that filling in the missing values leads to a loss in information, no affair what imputation method we used.

That's because saying that the data is missing is informative in itself, and the algorithm should know about information technology. Otherwise, we're merely reinforcing the blueprint already be past other features.

This is particularly important when the missing data doesn't happen at random. Take for example a conducted survey where most people from a specific race refuse to answer a certain question.

Missing numeric data can be filled in with say, 0, but has these zeros must be ignored when computing whatever statistical value or plotting the distribution.

While categorical data can be filled in with say, "Missing": A new category which tells that this piece of data is missing.

— Take into consideration …

Missing values are not the same every bit default values. For example, zero can be interpreted equally either missing or default, but not both.

Missing values are not "unknown". A conducted research where some people didn't remember whether they take been bullied or not at the school, should exist treated and labelled as unknown and non missing.

Every time nosotros driblet or impute values we are losing data. So, flagging might come up to the rescue.

Outliers

They are values that are significantly different from all other observations. Any data value that lies more than than (one.5 * IQR) abroad from the Q1 and Q3 quartiles is considered an outlier.

Outliers are innocent until proven guilty. With that being said, they should not exist removed unless there is a good reason for that.

For example, one can notice some weird, suspicious values that are unlikely to happen, and so decides to remove them. Though, they worth investigating before removing.

It is also worth mentioning that some models, like linear regression, are very sensitive to outliers. In other words, outliers might throw the model off from where well-nigh of the data lie.

In-record & cross-datasets errors

These errors result from having two or more values in the same row or across datasets that contradict with each other.

For example, if nosotros have a dataset about the toll of living in cities. The total column must be equivalent to the sum of rent, ship, and nutrient.

                      city       rent  transportation nutrient  full            
libson     500        20         40    560
paris      750        forty         threescore    850

Similarly, a child tin't exist married. An employee'due south salary tin can't be less than the calculated taxes.

The same idea applies to related data beyond different datasets.

Verifying

When done, i should verify correctness by re-inspecting the data and making sure it rules and constraints do hold.

For example, afterward filling out the missing data, they might violate any of the rules and constraints.

It might involve some manual correction if non possible otherwise.

Reporting

Reporting how healthy the data is, is equally of import to cleaning.

As mentioned before, software packages or libraries can generate reports of the changes made, which rules were violated, and how many times.

In addition to logging the violations, the causes of these errors should be considered. Why did they happen in the showtime place?.

Terminal words …

If you made information technology that far, I am happy you lot were able to hold until the end. But, None of what mentioned is valuable without embracing the quality culture.

No matter how robust and strong the validation and cleaning process is, 1 volition continue to suffer equally new data come up in.

It is ameliorate to guard yourself against a disease instead of spending the fourth dimension and effort to remedy information technology.

These questions help to evaluate and improve the data quality:

How the data is nerveless, and under what conditions?. The environs where the data was collected does matter. The environment includes, but not limited to, the location, timing, weather weather, etc.

Questioning subjects about their opinion regarding whatever while they are on their way to work is not the same every bit while they are at dwelling house. Patients under a study who accept difficulties using the tablets to reply a questionnaire might throw off the results.

What does the data correspond?. Does it include everyone? Only the people in the city?. Or, perhaps, just those who opted to answer because they had a potent opinion about the topic.

What are the methods used to clean the data and why?. Different methods tin be better in different situations or with different data types.

Do you invest the fourth dimension and money in improving the procedure?. Investing in people and the process is as critical as investing in the engineering science.

And finally, … it doesn't become without saying,

Cheers for reading!

Feel costless to reach on LinkedIn or Medium.

Source: https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

Posted by: shawocked2001.blogspot.com