How to Clean Data in SQL Server

Michael Flores
Geek Culture
Published in
8 min readApr 4, 2021

--

For my blog this week, I wanted to shift gears towards developing my SQL skills further. One of the topics that I wanted to focus on was how to work with messy data and how to properly clean it. Maintaining clean data is an essential part of the data science process. It allows for easy navigation and exploration of the data for further analysis.

In order to learn more about how data cleaning is done in SQL, I took DataCamp’s “Cleaning Data in SQL Server Databases” course. In the course, I learned how to deal with improper string values, missing or duplicate values, converting data types, and more. The dataset that was mainly used in this course is based on USA monthly airline flights in 2014 and 2015.

Cleaning Strings

Leading Zeroes

The first exercise I had to deal with was filling numbers with leading zeroes. In the dataset provided, we have a column named registration_code which contains values of different formats.

Sample of registration_code column values

Some values in the column contain only three digits but the correct format is with nine digits. In order to fix the values with missing digits, we need to add leading zeroes to the values that are missing zeroes. We utilize the REPLICATE() and the CONCAT() function.

SELECT 
CONCAT(REPLICATE('0', 9 -LEN(registration_code)), registration_code) AS registration_code
FROM flight_statistics

The REPLICATE() function will repeat a string a specified number of times. For the values that have only three digits, we’ll need to add six zeroes. By getting the length of the values using LEN(), we can then subtract that length from 9 and obtain the number of leading zeroes we need to add. We then concatenate that string of zeroes with the original value and we have fixed the formatting for that column. Another way of doing this is using the FORMAT() function.

SELECT FORMAT(CAST(registration_code AS INT), '000000000') AS registration_code
FROM flight_statistics

The FORMAT() function will format a value with the specified format. We cast our registration_code as an integer and then reformat it so that it consists of nine digits. Any value that does not have nine digits will be formatted to have zeroes filled in for missing digits.

Fixed format of registration_code column

Messy Strings

A common mistake we can find with strings are extra leading and trailing spaces. Some values of the airport_name column have extra spaces. This can be resolved using the TRIM() function which removes a specified character from the start and end of a string. In versions of SQL Server older than 2017, we utilize RTRIM() and LTRIM(), where RTRIM removes trailing spaces and LTRIM removes leading spaces.

SELECT airport_code, TRIM(airport_name) AS airport_name, airport_city, airport_state 
FROM airports

There may be instances in which we have different string values within a column to represent the same value. Suppose that we wanted to query our dataset to find only the airports located in the city Chicago. In the airport_city column, Chicago is represented as ‘Chicago’ or as an abbreviation such as ‘CH’ or ‘ch’. We would want to find a way to combine these cases into one singular value for ease of analysis.

Suppose in the previous example that we wish to have only the city name spelled out. In order to replace the incorrect values, we use the REPLACE() function. REPLACE() will replace all instances of a substring with another substring we specify. Our first instinct would be to replace all instances of ‘CH’ with ‘Chicago’ but this poses a problem since it would turn ‘Chicago’ into ‘Chicagoicago’. The best way to approach this is to use a CASE statement that would only replace values that are not ‘Chicago’.

SELECT airport_code, airport_name,
CASE
WHEN airport_city <> 'Chicago' THEN REPLACE(airport_city, 'ch', 'Chicago')
ELSE airport_city
END AS airport_city,
airport_state
FROM airports
WHERE airport_code IN ('ORD', 'MDW')

Missing and Duplicate Data

Missing Data

There may be instances in which we have missing values within our columns. This can happen for a variety of reasons, such as the intentional lack of a value or some error upon inputting the data. You have a number of options at your disposal for dealing with missing data. We can choose to leave the rows with missing values as is, remove them from our queries, or use some other value as a replacement value, such as the average value for that column.

If we want to simply select all the rows where a column value is not missing, we need to add WHERE column_name IS NOT NULL to our queries. If instead we want to find the rows that have missing values for that column, we instead use IS NULL .

SELECT *
FROM airports
WHERE airport_city IS NOT NULL

What about the instances in which the value is not actually null but has a value such as a blank space or ‘UNKNOWN’? In the WHERE clause of our queries, we can use <> or = to isolate the rows we want. If we know that blank spaces are being used for null values, we can select rows where airport_city is not a blank space as airport_city <> ‘’.

-- Exclude the rows where airport_city is missing
SELECT *
FROM airports
WHERE airport_city <> ''

We may not necessarily want to exclude rows with null values. However, in order to proceed further, we cannot have NULL values in these columns. Suppose we wanted to use ‘Unknown’ as a placeholder value for columns in which we do have null values. To replace these null values, we use ISNULL() and state that the string we want to replace them with is ‘Unknown’.

SELECT airport_code, airport_name,
ISNULL(airport_city, 'Unknown') AS airport_city,
ISNULL(airport_state, 'Unknown') AS airport_state
FROM airports

Now let’s suppose we wanted to create a new column called location. To generate the values in this column, we wanted to pull the first non-null value from multiple columns, first from airport_city and then from airport_state. If there are null values in both columns, we will return the string ‘Unknown’ as the value for location. We will use COALESCE() which will return the first non-null result.

SELECT airport_code, airport_name, airport_city, airport_state,
COALESCE(airport_city, airport_state, 'Unknown') AS location
FROM airports
Sample result from query above. Note how COALESCE works with the null columns.

Duplicated Data

Duplicated data can come in a number of forms. It may simply be repeated rows in which the values for all columns are identical. There may also be rows in which a majority of columns or only a few columns share identical values. In order to help distinguish duplicated rows that are not apparent, we will need to identify the repeating group. The repeating group consists of the columns that are meant to be identifying for a unique instance of an entity. For this dataset, airport_code, carrier_code, and registration_date are considered the repeating group since they are only registered once a month.

We will use the ROW_NUMBER() function to help identify duplicate rows. In the PARTITION BY clause, we specify our repeating group. The first instance of a will have a row number of 1, meaning it is the first instance of that particular combination of the repeating group columns. If there are multiple instances of that repeating group, their row number will sequentially increase based on the number of rows that have previously been identified.

SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
airport_code,
carrier_code,
registration_date
ORDER BY
airport_code,
carrier_code,
registration_date
) row_num
FROM flight_statistics

If we wanted to identify solely non-duplicate rows, we will call the previous query and have it exclude duplicate rows by only finding rows with a row_num of 1.

WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
airport_code,
carrier_code,
registration_date
ORDER BY
airport_code,
carrier_code,
registration_date
) row_num
FROM flight_statistics
)
SELECT * FROM cte
-- Exclude duplicates
WHERE row_num = 1;

Converting Data Types

The data type of a column is important in determine what operations we can perform on it. For example, we won’t be able to perform comparisons or use aggregate functions on columns of VARCHAR type. By using either the CAST() or CONVERT() functions, we can specify what data type we want to convert our columns into.

SELECT AVG(CAST(num_ratings AS INT))
FROM series
WHERE CAST(num_ratings AS INT) BETWEEN 0 AND 5000
---------------------------------------------------------
SELECT AVG(CONVERT(INT, num_ratings))
FROM series
WHERE CONVERT(INT, num_ratings) BETWEEN 0 AND 5000

Date Formatting

Dates can have a number of different formats depending on location and language. Month/day/year, day/month/year, and year/month/day are just some of the formats in which we can represent a date. If we wish to change the date format to another format, we will again need to use CAST() or CONVERT() .

SELECT CAST(registration_date AS DATE), 101) AS registration_date
FROM flight_statistics
WHERE CONVERT(VARCHAR(10), CAST(registration_date AS DATE), 101)
BETWEEN '01/01/2014' AND '06/30/2014'
----------------------------------------------------------------
SELECT FORMAT(CAST(entry_date AS DATE), 'dd/MM/yyyy') AS entry_date
from pilots

For CONVERT(), we denote the style in which we wish to convert through an integer that corresponds to a specific style. For instance , if I wanted to convert a date column to mm/dd/yyyy format, I would use 101 as the style. A list of formats that date can be converted to can be found at this link. For FORMAT(), we specify the format we want with specific parameters, such as short date/long date, the actual desired order for days, months, year, and an optional language parameter. The list of available formats can be found here.

Inaccurate Data

Something we may come across in our dataset are values that are outside the range of expected values. It would be incorrect to see someone’s height listed as 400 inches or a movie rating of 11 on a 10 point scale. If we know the expected range for those values, we can query for rows in which the value for that column is BETWEEN the upper and lower limits of the range. The same applies for finding rows that are outside of the range. We can use either the BETWEEN keyword or any comparison operators like >, <, <=, >=.

SELECT * 
FROM series
WHERE num_ratings >= 0 AND num_ratings <= 5000

These were some of the ways in which I learned how to clean data in Microsoft SQL Server. These techniques provide a proper starting point for most instances but every situation will be different from each other. There may be instances in which more intricate coding or other functions are required. As always, using your best judgement is crucial in determining the best approach for handling dirty data. For now, I feel as though I am more proficient at being able to properly clean my data in SQL and look forward to using what I learned in the future. Thank you for reading!

--

--