JustinDanielMeyer.com - Teaching - Visual Presentation of Data

Graphing and the Visual Presentation of Data
by Justin Danial Meyer
version: 09Sep05

Part 1: Introduction and Basics
Part 2: Graph Elements
Part 3: Graph Types
Part 4: Putting the Graph Together
Part 5: Conclusion and Other Resources

Trendline Types

Merely graphing a series of data is often not the whole story. Most plots are expected to follow a trend or conform to some sort of mathematical model. This is where the trendline comes into play. Using the formula of the expected trend, a line is plotted as close as possible to the data in order to mimic the trend. When the fit value is maximized (commonly the least-squares method) through iterative calculation, the trend function is said to be at its "best fit". It is a good habit to calculate the trend of a function, even if not requested or expected, to determine how well it conforms to the expected behavior.

The trendlines on a graph should be treated similarly to the plotted sereis of data. In fact, most math/spreadsheet packages automatically include trend lines in the legend. Trendlines should be formatted (do not blindly accept the default formats) so they do not overwhelm the data series, plot line, or data point markers--and are not themselves drowned out. Multiple trendlines do not often require significant formatting for the viewer to distinguish ownership. Going to such lengths is only required when the plots are tightly packed.

When analyzing data for a lab report or similar type of work, it is hit-or-miss on the inclusion of trendlines. Although the requirements will vary, the small amount of work required to model the data will not go unappreciated. In fact, although you may not believe the data needs to be modeled, your viewer may or may not agree with you, and any reasonable effort you can expend to facilitate their understanding is more than worth it.

With the ability to model data comes the desire to predict "future" data and estimate alternative points within the data range. Care must be taken when extrapolating and interpolating, respectively. The distance from the nearest valid data point and its own fit within the model will affect the accuracy of the estimation to some degree. It will further depend on the type of trendline, the over-all number of points in the series and other factors.

The most common types of trendlines are linear, polynomial (including parabolic), exponential (some kinds of which are technically a simple type of polynomial), logarithmic, asymptotic, and running average. There are a few other, less commonly used types, but they are for particular data types.

Data & Line Format (Multiple plots)

Many graphs have more than one series of data plotted. Some effort must be made to permit the viewer to easily determine the difference between these different data series. This may be accomplished through the use of different types of data markers, colored plot lines, patterned plot lines, shaded/colored areas, and drawing elements. Of course, a combination of these will work well, all of which require a legend on the graph.

Due to the popularity of MicroSoft Excel and users' willingness to let the graphing "wizard" control the output, the most common multi-plot distinction mechanism is a combination of data marker type and plot line color. On screen, this works out fine (color choices aside), but for increasing numbers of series it quickly breaks down to data marker type. [This is still hard for the viewer, because the wizard goes through a series of marker types with gradual changes, i.e. solid square to solid triangle, not dramatic changes such as solid square to open cirle to cross.] The difficulties are rather marked when printing in black and white.

In order to anticipate B&W output and other unforseen issues, care should be taken to creae series plots which are distinctive in several ways, if possible. In the Effect of Employee Dress on Sales graph, different plot line colors are used, as are different data point markers. It would have been even better if the plot tracing the number of employees wearing ties had hollow markers and a dashed plot line. However, the drawing elements really help out in this respect, so it is probably not too big an issue for most viewers printing in balck and white.

With few or many series, it is helpful to include a trend in the variation used to aid the viewers' ability to distinguish between series and identify trends. In the Effect of Employee Dress on Sales graph, the data plot is black and the sales plot is green (like American dollars). A similar trend is used in Net Revenues for Pipe Fittings, Inc., where red signifies a loss for the year and black indicates a profit for the respective year. In the Rapid crystallization of MOCVD Al2O3 and particularly in the Vertical Annealing System Temp. Profiles graphs, color is used to denote the relative temperature at which successive readings were taken. In these two, lower temperatures are denoted by by cooler colors (black, blue) and higher temperatures by warmer colors (yellow, red). [As an intellectual note, this order does not reflect the true order of visible radiation given off by heated materials. Emission of visible radiation typically begins around 700 degrees C with red and progresses to red-orange, orange, yellow, white, white-blue, blue.]

The choice of how to distinguish between series plots yet show a smooth progression will often require a delicate touch. The balance between contrast and closeness, what is pleasing to the eye yet still easily distinguishable is estimable, but subjective. The best method is to use known associations and trends. One typical trend is temperature: red/hot to yellow/warm to blue/cold. For a plain ordered series without an obvious association, one could easily use early/light gray to now/medium gray to future/black, using time/grayscale as an example. Some graphs are strictly associative. For a biomass graph the associations might be along the lines of Table 2, below. In the World Energy Supplies by Source graphs--absolute and percentage--the legend shows another set of associations, listed in the Table 3, below.

Table 2		Table 3
Color	Biomass	Color	Energy Source
Blue	Water	Red	Nuclear
White/Clear	Ice	Yellow	Solar
Brown	Earth	Blue	Hydrolic
Green	Foliage	White/Clear	Wind
Yellow	Crops	Light gray	Natural gas
Light green	Sub-urban areas	Black	Oil
Black	Metropoliton areas	Dark gray	Coal
		Brown	Wood

Other examples abound, such as the use of fill types to denote geological sample profiles and the use of color coding complicated data sets to show a 3rd "dimension" on a 2-D graph. In the end, the use of color and other types of plot element variation are good for pointing out key points/features in simple graphs to pre-digesting and cataloging large sets of information for the viewer in multi-plot graphs. All element and style variations must be used to effectively show contrast between data/sereis and progression across the same.

Overall Layout and Graph Element Content

The graph elements need to be sized appropriately for the viewer to use easily. The defaults of many programs set the font sizes too small or the locations to far from the edge of the screen/page. While this correctly places emphasis on the plot area and series, it is not an optimal use of the space for presenting your information to the viewer. Elements such as graph title, axis labels, and scale numbers are the most common victims of this.

There are several graph elements which must be included in a graph, no matter the simplicity or audience. The graph title, axis labels, and scale are critical to complete and accurate conveyance of your point to the viewer. In addition, however, these and other elements must be accurate and useful. For example, the graph title should explain your concept, not reiterate the axis labels. "Location vs. Temp." is not useful, Vertical Annealing System Temp. Profiles is useful. Using the same graph, an axis labeled "Temperature" or "Position" is not as descriptive or accurate as "Set Point (deg C)" or "Position Relative to Center of Furnace (inches)". Notice that in the more descriptive axis titles, units were included. Units must always be included in axis titles, there are no exceptions! Even if the y-axis has arbitrary units--typical for intensity measurements--this can be noted as "a.u." or spelled out. Data point labels should also use units (if necessary, only the y-data can be displayed for space and clarity purposes), but may be omitted only in the case of a y-scale of arbitrary units.

Attention to detail when configuring graph elements pays dividends. Ensure the units in your scale labels are present, complete, and correct--do not forget to include the degree symbol. Duplicate labeling on scales and titles should be avoided, and while the plot area can often be extended closer to the graph title and axis lables, do not get too close (the height of one scale number is a good spacing guage to use). Text effects such as shadowing, outlining, and underlining on labels are not acceptable. The use of text size and bolding with occasional use of italics is more than sufficient to convey the information to the viewer in a visually efficient and pleasing manner.

In another section, we briefly reviewed plot density and readibility. In addition to many plots on a graph, it is also possible to have an excess of other elements. Auto-formatting of axes can easily introduce an unnecessary number of unit numbers. A good range, depending on data and needs, is 3-20. Corresponding to these scale references are grid lines. The use of a few lines at a key interval to permit ready comparison is very desireable, but seldom are more than 2 grid lines per 10% of scale height/length needed. Grid lines should be in the background, especially when numerous--use 20-50% gray and make them dashed, in order to keep them from distracting the viewer and being anything more than a reference.

Last among the elements, the title should be as complete as possible. If necessary, size should be sacrificed to permit another line of text. Often, is it best to contruct the title last, then the picture and data analysis is pretty much complete.

As mentioned earlier, several graphing programs have wizards and features which are turned on by default, adding to viewer distraction. MicroSoft Excel is one of the more habitual offenders, due to its overwheming market share. The most common assaults are the plot area coloring and the improper (and often unnecessary) use of color. It should be obvious to all that the use of a non-white plot area or graph background is not acceptable if it serves no significant purpose (other than to waste ink). Providing 'contrast' to the series lines is not a purpose, but providing additional information for the viewer is desirable. Many graphs plot only a few series, negating the necessity of color to differentiate between them. Automatic color selection aside, most graphs may easily be constructed with 2-3 levels of gray, 1-2 line styles, and variation in the plot line weight. This permits the ready identification of a series among 5-6 others, before color is even involved. Color could, however, make the differentiation even easier, so if a color printer is known to be at-hand, it may be helpful to use it.

Labeling Data Points and Lines (If Necessary)

The viewer may or not be able to take their time interpreting and understanding your graph. It is therefore prudent to call out attention to specifice points or features if they are not immediately obvious to the average viewer. A simple label on a data point or feature of importance will immediately bring the readers attention to your message or supporting data. On occasion, the series are closely spaced and may require a short line or arrow (as is done with diaglog bubbles in comic strips) to facilitate viewer clarity. Labels are particularly useful when identifiying events, changes in process, or other cause-effect relationships, as in Effect of Employee Dress on Sales. Lastly, but much less frequently, labels are sometimes used to provide more information than would fit conveniently in the legend or title.

Clarity of Concept Communication

Embedding:
In some cases a graph may convey your point to the viewer, but it would be aided by supporting material. In such situations, the inclusion of other material is of great benefit. The most common aids are pictures, diagrams, and in-sets. In-set graphs are used to show a detail which is not easily perceived when viewing the entire data set(s). Diagrams and figures are used when the association between the data and trend(s) needs to be (or would benefit from being) clarified, reinforced, or explained in detail.

[example of graph with inset and detail]

The Stevens Pure Aluminide Coating (180 min.) graph is a prime example of this concept, albeit taken to its limit. The graph consists of 14 series tracing the elemental profiles of 10 elements in a sample--4 elements are repeated by a different technique. The sample is known to have 3 regions, listed at the top of the graph. And, while two vertical lines representing the boundaries would have sufficed, a picture showing the character of the measured area was taken and sized to match the data scale before being set as the backgound for the plot area. In addition to nicely delineating the zones in the sample, the detail shown in the micrograph provides a level of correllation possible in no other way.

Composites / overlays:
Every action on your part and every element of the graph should be focused on clearly communicating your message to the viewer. Sometimes two or three concepts may be conveyed on a single graph, but that should be avoided unless they are so tightly intertwined that separating them is almost nearly impossible or would detract from the full message. As the graph nears completion, each element should be re-evaluated for its contribution to the overall message. In the Stevens Pure Aluminide Coating (180 min.) graph, two concepts are being conveyed. Firstly, the profile of each measured element is being evaluated on both position in the sample and its relationship to the other elements. Contributing to the evaluation, a micrograph was taken of the measured area and scaled to fit as the background of the plot area. The second is the comparison of two measurement techniques--the four most significant elements were remeasured using a second technique. These four repeated measurements need to be on this particular graph becuase they demonstrate that the first technique has difficulty measuring elemental concentrations which change rapidly and are non-homogeneous--as they are at the zone boundaries.

Statistics and Comparisons

If comparisons are to be drawn from your graph(s), be careful to show the data in a complete light. An adaptation of data from the Broad street Cholera Epidemic of 1858 demonstrates how the exact same data can be presented to support different assertions.

In the frist graph, above, the deaths are tallied by day, and it appears the removal of the pump handle had no effect.

In the second graph, above, the deaths are tallied by week of the month (1st, 2nd, 3rd week of September) and it appears there was a surge of deaths, with a reduction in following weeks--which may or may not be attributable to the pump handle being removed.

In the third and last graph, above, the deaths were tallied by "work" week (Sunday through Saturday), and it appears deaths would have continued to climb were it not for the removal of the Broad street pump handle.

Obviously, the choice of data "basket" size was important to choosing and conveying the "right" message to the viewer. Incidentally, the 3-5 day incubation period of the disease needs to be factored in to any and all of the graphs, too. This last fact would help mitigate confussion--intended or otherwise--in the messages of the first and second graphs.

There are other, easier techniques for visual subterfuge. The most common two are scale truncation and dimensional manipulation.

Scale Truncation:
In scale truncation, part of one or both scales has a section between zero and the series removed. This has the effect of amplifying the character and features of the plots. An egregious example of this may be observed in the # People Wearing Ties and Revenues vs. Week of 2004 graph. Here the left y-axis starts at zero and the right y-axis has had over 1/2 removed...over-emphasizing the relationship shown. This graph would be a case of either intentional mis-representation or sever ignorance.

The waters here can be somewhat murky, as this practice is related to both our previous discussion on Accuracy versus Resolution and Clarity of Concept Communication above. It is sometimes the case that no deception was intended, just good clarity of plotted series. Generally, if the space between zero and when a plotted series begins exceeds 60-70% of the total graph size, it is time to start thinking about scale truncation. However, in a scale with no absolute end, i.e. time, date or catagory list, this is a moot point as the scale is begun where it is needed.

[For technical accuracy, it should be noted that there are a few scales which are routinely truncated. The most common--by far--is the temperature scale. The true "zero" of temperature is far below 0 degF or 0 degC. This is considered acceptable, due to the social acceptance of the 0 values near the freezing point of water, a very important reference point in our ecosystem and environments.]

Dimensional Manipulation:
The second type of visual subterfuge is causes dimensional confusion. In an effort to make simple graphs visually interesting to the public, media members often employ the use of graphics and representative icons. They often size the sub-graphics on one scale to show the change in values, according to the data. However, as in the [graph] example below, they scale up the entire image, not just the one axis, incurring a visual change of up to x^2.

[example of graph with scaled graphics]

If we simplify the above image with squares/cubes for the icons used, we bet the graph below.

[example of graph with scaled squares for viewer reference]

Hence, the careless (or intentional) use of single-dimensional scaling on a 2-D graphic had visual reprecussions becuase of how our minds automatically process visual information. It takes effort to remove this error and interpret the data correctly--although it should not! This effect is even more marked when the graphics used are drawn in perspective (3-D).

Part 5: Conclusion and Other Resources