At CAN, as needed we use the visualization software Tableau to create reports and dashboards for our clients. Also, because Tableau is capable of handling large amounts of data very quickly, we’ve started using it to explore data visually during the data discovery stage of each project. We use Tableau to check the quality of data, find outliers, and get a sense of the properties of a data set, such as dispersion, central tendency, clustering, etc., before we apply statistical analysis or build predictive models. A Tableau feature, especially useful for exploring data, are Reference Lines.
This blog post explains a few ways that CAN uses Tableau to explore a data set.
For this example, we’ll look at this table of Metropolitan Statistical Areas of the United States of America (MSAs). Within this table, we are interested in four variables for each MSA: Rank, 2010 population, 2000 population, the change in population between the decennial censuses. Tableau will geocode states to create maps, but to access that feature we need to separate the MSA name from the state. Also, some MSAs are located within more than one state, like the Omaha-Council Bluffs, NE-IA MSA. So, to access the maps feature, we can create a new variable called ‘primary state’ by manipulating the data.
Now that we have data, we can connect it to Tableau. If you are following along, you might find a problem at this point. The percent change or growth rate is listed as a dimension, is not a measure as we would expect. This is due the fact that the formatting is off for some of the percentages in the change column: some of the numbers are entered as text. This can show us very quickly that the data needs to be normalized before we can do the analysis. With this data set we may have found the error in an excel spreadsheet or another program, but in a bigger data sets looking at all of the values, one by one, is a waste of time. Tableau can help you assess the quality of the data very quickly, often faster than filtering and sorting in other programs. Buy visualizing the data before running statistical tests or data mining tools, we can find these types of problems and get clear picture of the data in one step.
Here’s the file, MSA public data, for you to play with a primary state variable and all the data normalization issues resolved.
Tableau has many tools that are useful for exploring data. One tool that helps us explore data is Reference Lines. You can add a Reference Line to a graph by right clicking on the axis of a measure variable. The Add Reference Lines Dialogue box allows you to add Lines, Bands, and Distributions. Lines can emphasize an important value or threshold. Bands allow you to emphasize a range of values. And Distributions allow you to emphasis more than one range of values, for example, values + or – one standard deviation, quartiles, and percentiles. These are very flexible tools that can be used for many applications. Sometimes we will set Distributions to show standard deviations so we can quickly visualize the dispersion of the data, whether it’s normally distributed, and if there are any statistical outliers that might need to be removed from the data set.
Tableau’s Reference Lines are a good example of how CAN emphasizes pre-attentive processing in our visualizations. Lines, Bands and Distributions create enclosures that people understand intuitively: people do not need to think to understand that data above a line is different from data below the line. It's intuitive. When the significance of the reference line or band is an important business metric like profitability, customer loyalty, or a budgeted value, our clients understand their performance against that metric immediately.
If you would like more information download our eBook, "Dashboards: Take a closer look at your data".
[contact-form-7 id="4056" title="Dashboards eBook - Using Tableau Reference Lines to Explore Data"]