What is this about?
Data visualization is a great tool when dealing with data or information. It can be used to gain better understanding of a subject or as a tool for communication. A wide range of possibilities have been developed over a very long period of time1, and not only the famed pie charts exist. Feel free to explore the possibilities and not use pie charts2, unless you are visualizing pies of course.
Visualization is often one of the later steps in a large process. And in short, good data is needed to get good visuals. No point in visualizing bad data. How do you get good data? That depends. In the lab, to hopefully help you get quality data, you might take a process that would look like the following:
- Identifying and research of how to acquire data
- Acquisition of the data
- Processing of the data
- Extracting results
- Validation of the data/results
- Visualize the data/results.
It is important to be flexible. I gave an example of a process, but you may be required to jump back a few steps, especially on newer adventures. For example, a visual may be finished, only then pointing out that the data does not make any sense. It is time to evaluate what went wrong, then backup a few steps and try again.
The process leading up to and including visualization can be long and laborious. It is also a complex process that can vary highly depending on the subject. As such, what we will be focusing on instead is a very brief introduction to data visualization: essentially, why it is important. This will be followed by a few questions or ideas that can be asked when starting work on a data visualization project and a short overview of the techniques available.
Why you should visualize data
Visualizing data can be an important part of dealing with data. It can play a crucial role with understanding the underlying data points3,4 (I really recommend reading that article, it has good visuals). We are pretty bad at understanding numbers5, be it very small, large numbers or an immense quantity of numbers. Even if the numbers are relatively small and they are not too many data points, it can still become tricky. A good example of this is the “Anscombe's quartet”6. It is possible for example to think that comparing the average of 3 datasets is sufficient, easy enough to understand and that they are within values that we understand. But think again, in reality, they may differ wildly.
For example, take these averages of 3 different datasets with 256 sequential values: 127, 131 and 128. The minimum value is 0 and the maximum value is 255. These averages look similar, but if we visualize them the story becomes quite different:
All this to say: data visualization is an important part of data analysis. It should be done correctly and with consideration.
??????? (A collection of thinking face emoji)
Before building new visuals to communicate a newly acquired dataset, the first step should be to think about the data in front of you. Starting by graphing, sketching and drawing randomly may work sometimes, but more often than not, bad "variables" will be chosen and the visuals won’t be optimal. It might work if your field has already defined everything needed because the data is consistent or you have been working with similar data for years.
Too ease in the process, it can be helpful to ask questions about the data. These questions can then lead to more questions that lead to many more questions. Here are a few that could be a good starting point to better understand the data and help decide how to display your information.
A very good place to start is asking: who is your audience? Are you the only person who is going to see this or maybe a few coworkers? Is it for a report that will be looked at by the whole company? Maybe you are creating a visual for the marketing department showcasing a key element.
The previous question is important because it helps you answer the following questions:
- What familiarity and knowledge does your audience have of the subject?
- What type of understanding will your audience have of the visual type? Will a scatter plot work or will a simple image make a longer lasting impression and be easier to understand?
You may have other complications to think about:
- Does the visualization need to happen in an automated way? For example, is this visualization part of a report that needs to be generated weekly, maybe even hourly? This may severely limit your options. Relation graphs can be hard to automate, while still being consistent and understandable.
- In the extreme of the question above, does your data need live visualization? This brings yet another layer of complexity and questioning.
- The size of your dataset may also limit what can be done with it. Do you have a few entries or a datacenter worth of information? Every piece of software may not be able to visualize all of that.
This may not apply in all cases or it may be the only thing to think about, but it can be useful to think about the relations and types of relations in your dataset. This includes: within the dataset itself, with other datasets and the general context that it will be displayed in. The first 2 questions are very useful in the context of relations graphs. You can't build a relation graph without relations... This will start to make you understand the complexity that it might have, what kind of complexity is acceptable and ways that may help reduce it, thus making it more readable. The last question is helpful determining what information needs to be included with the visual. An example that I can give is the error associated with the data. It can be a very useful piece of information to display, it may even be required in some cases.
It may be worth to ask if visualizing is the correct way to communicate that information. Words might be a better replacement depending on the context.
I almost forgot this one, because it's the most obvious one, but also possibly the most complex: think about the data you have.
After you asked all the questions and have some answers for some of them, you can now start ruling out specific techniques to visualize data and you will have a better feel for what might work or not. You might also have new questions that need answering.
To boil down all of this, the main questions are essentially:
- What is my information?
- Who will see this and what is their familiarity with the subject?
It will inform a lot of your decisions.
This did not really fit elsewhere, but I felt like it should be included, so let’s take a small detour. This can sometimes be easily changed after the fact, but it may not, hence why you should think about it beforehand a bit. The color scheme may have a large impact depending on context and who is looking at it or where it is being looked at. For example, on a more general and less formal visual, you may choose less readable colors to match more closely the rest of the material while on a scientific publication, you will choose visuals that make it easier to read. Or if you are projecting your visual, the colors may get distorted and muted, making them indistinct. It's also a good idea to consider color blindness when choosing a color scheme. This beautiful color gradient is now a mush of color that can no longer be interpreted if you are color blind.
The fun part
All the questions have been asked. It is now time to choose which type of visual will be needed. We will go over some available options. At the end of the article are a few pieces of software that can help you getting started with visualization.
"The ones you probably expect"
This first group is the standard bread and butter of data visualization. While they don't do everything well, a lot of data can be mushed into them. It plays well with numeric type data.
They include your standard and common chart types: line, bar/histogram and scatter plots (see Figure 4).
These graphs are useful for visualizing relations. They can be hard to maintain and get unwieldy as they grow. They often need to be built manually to be logical, but if we have too much data, we must leave the placement and sorting to algorithms. The auto-generated graphs rely on proper tagging and sorting of data. This gives a different look and feel and forces us to use them differently. It can also become hard to extract useful information out of very large graphs.
They include flow charts type visuals or relation graphs (see Figure 5).
They are a few other types of visualization that don't quite fit into the other categories. This section could include all the domain’s specific ways to visualize data.
A few examples could include heat maps or topographic maps. Not every industry has a use for those visualization methods. It could also include a simple table with columns and rows filled with data.
Infographics can be great to condense a lot of information down to a small visual space. It also has the advantage of often allowing quick understanding of what is at play, but has the disadvantage of skipping over details.
Uniquely built visuals
This is possibly the most complex option and would often require collaboration with a few people with diverse skills. It is possible to develop visuals specifically to communicate the underlying data.
In this article, we talked about why you should visualize data, we went over some questions to get you thinking about data visualization and we briefly talked about what visuals are available. A few pieces of software and references that I find useful are also available after this and could be useful starting point.
The main thing I hope you take away from this is to think about who your visuals are for and how they represent your data. Make your data tell a story.
Did I miss any important information in my story? Do you have any question that you ask when building visuals?
Marc Cormier, Jr Web developer
Missing Link Technologies
Other Article by Marc:
Basic Programming Skills Might be a Good Idea for Someone in the Sciences
Useful references and links
- Microsoft Excel
- Google Sheets
- MathPlotLib/Jupyter Notebook
- The Visual Display of Quantitative Information2
- Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing3,4
- A Better Default Colormap for Matplotlib7
- Perceptual Color Maps in matplotlib for Oceanography8
Other Related Articles Created by MLT:
- Visualizing Deep Learning
- Visualizing deep learning: Part 2
- Visual Guide To Software Development - A Series
1. Friendly, M. A brief history of data visualization. in Handbook of data visualization 15–56 (Springer, 2008).
2. Tufte, E. R. The Visual Display of Quantitative Information. (Graphics Press, 1983).
3. Matejka, J. Fitzmaurice, G. Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI ’17 1290–1294 (ACM Press, 2017). doi:10.1145/3025453.3025912
4. Matejka, J. Fitzmaurice, G. The Datasaurus Dozen - Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing | Autodesk Research. Available at: https://www.autodeskresearch.com/publications/samestats. (Accessed: 5th April 2019)
5. Resnick, I., Newcombe, N. S. & Shipley, T. F. Dealing with Big Numbers: Representation and Understanding of Magnitudes Outside of Human Experience. Cogn. Sci. 41, 1020–1041 (2017).
6. Anscombe, F. J. Graphs in statistical analysis. Am. Stat. 27, 17–21 (1973).
7. Smith, N. & Walt, S. van der. A Better Default Colormap for Matplotlib.
8. Kristen Thyng. Perceptual Color Maps in matplotlib for Oceanography.