Statistics

We interpret large amounts of information on a daily basis, more or less consciously. The media has become relatively sophisticated at using graphs and ratios to reduce large amounts of data down to understandable information. The part of statistics that describes data with estimates and graphs is called descriptive statistics.

Using descriptive statistics is often not an option, but a necessity. When you join a company, you will probably have to handle large amounts of data in the form of a spreadsheet, such as an Excel spreadsheet. Whether it’s financial ratios or pure data, analysis is basically subordinate. The aim is to reduce the data to essential information and present it in an intelligible way.

  • Descriptive statistics involves reducing data to essential information and presenting it in an intelligible way.

Please note that the calculations we perform in this section can all be made with the Statlearn program.

How to find point estimates

As we said in the section titled The ABCs of Statistics, statistics draws a distinction between point estimates and population parameters. Just as words can describe the features of a face, point estimates and population parameters can describe the characteristics of a sample or population. This is not particularly relevant if you’re working with small amounts of data, but if you’re sitting down with 30,000 rows of data in a spreadsheet and tracking defects, point estimates give you quick and valuable insights.

Suppose you work as an equity analyst at a Danish bank. You have been asked to conduct a risk analysis for three shares. The aim is to investigate how stock prices for Microsoft, Nike and Danisco, respectively, have developed in the period spanning January to July of 2008. The results of the analysis must be used to advise a client who wants a share with a low risk profile.

Let us take a quick look at the equity prices in Table 2. We can readily see that all shares have shown some degree of variation during this period. Should we dig deeper and identify the shares that experienced the largest price fluctuations, and were thus the most risky, it would become a little more difficult to assess risk by simply reading the numbers in the table.

To develop a basis for the comparison of these three shares, we can start by calculating the average price. The average is also called the expected value. The expected value is a measure of the central value in the dataset, hence the term “mean.”

Selecting the Average or Median

An average should be used with the caveat that data need to be part of a relatively normal distribution, as shown in distribution A. In chapter 5, we will elaborate on the importance of normal distributions. So far, we can just note that data may be distributed differently, as shown in Figure 2.

If the distribution of data leans either to the right or left, as in B and C, this suggests that individual items deviate significantly from one another, indicating a skew. In those cases, the median is a more representative measure than the average.

The median is the value of the middle item in a data set sorted from lowest to highest value. In contrast to the average, the median is not influenced by extreme items, as it represents the value of the item in the middle of the data set. The median is not affected by the extraordinarily high or low values that characterize skewed distributions.

Let us consider a simple example. Imagine a city where 99% of households earn £500,000 and the last 1% of households earn 100 million. We can reason that the average will be pulled up tremendously by the 1% of households with higher incomes. In this situation, we would have a very right-skewed distribution of income. The average would be too high, and thus, a poor measure of the data set’s key value. It is, therefore, important to determine the extent to which your data are normally distributed when the average is used.

To calculate whether or not the data are normally distributed, you must measure the level of skew. The degree of disparity indicates how skewed or symmetrical your data are. The formula for determining skew is shown in the appendix to this section.

The distribution of data will be skewed to the left if the result is a negative number. If it is a positive number, the data will be skewed to the right. If the data follows a completely normal distribution, the result will be close to 0.

For a more exact assessment of whether data are normally distributed—a hypothesis test—see chapter 10.

Returning to the example of price trends for the three shares, if we assume that price movements are normally distributed, then the average prices are as given below:

Now that we have identified the average, the next step is to look at the variation in rates. This information is essential for comparing the risks of investing in the three shares. One of the most common methods to calculate variance is by using standard deviation.

The standard deviation can be interpreted as the item with the most “normal” deviation from the average. More specifically, it is the item representing the average deviation from the average.

EXAMPLE: Standard deviation of Microsoft’s stock price:

Note: Data are from Table 2

The standard deviations for all three stock prices are listed below:#

By calculating these standard deviations, we have quantified the average exchange rate fluctuations. At first, it seems that Nike is the share that has the greatest price fluctuations. This conclusion is only partially correct—we need to see the standard deviation in relation to the size of the average.

In the following example, we can see from distributions X and Y that a standard deviation of 4 goes with an average of 10. The distribution X has a higher relative variability than the distribution Y, which has a standard deviation of 100.

​To make the fluctuations of the three shares comparable, we can calculate the coefficient of variation (CV):

Looking at the variation, we can clearly see that the price of Microsoft has the largest relative fluctuations—the correlation holds. Statistically speaking, it ranks as the most risky share. Since a myriad of factors affect the price of a stock, descriptive statistics cannot stand alone. However, as a tool to quantify general trends and make various shares comparable, descriptive statistics can be extremely valuable.​

Alternative Measures of Dispersion

Just as it is important to use the median instead of the average when dealing with skewed distributions, it is important to think critically when deciding whether to use the standard deviation. Standard deviation should only be used with data that approximately follow a normal distribution. If data follow a skewed distribution, you should instead use the interquartile range, also known as the IQR, an alternative measure of dispersion.

IQR

The interquartile range is based on the same logic as the median. Thus, it is not sensitive to extreme items in the data, as is the case with the mean and standard deviation.

The interquartile range is the difference between the first and third quartiles

By taking the distance between the first and third quartile, the interquartile range becomes a stable target. This is due to the fact that the interquartile range is not affected by extreme items in either the minimum range, the first quartile (x-value: 26-33), or the maximum range, the third quartile (x-value: 37-43). See the following box and whisker chart.

Quartiles

When working with large data sets, quartiles can help provide an overview of the values of the observations. When data are sorted from smallest to largest, quartiles are used to divide data into four groups.

Calculation of quartiles: (n+ 1) x K/4 where “K” represents the first, second, or third

EXAMPLE

The following example is based on a list of items which shows the height of 15 randomly selected people. We will determine which to sort into the first quartile.

1. First quartile calculation: (15+ 1) x( 1/4)= 4

The value of the first four items corresponds to 167 cm. In other words, 167 cm is the highest value among the first 25% of the observations. Similarly, the second quartile corresponds to the value of the middle observation, which is 172 cm (median).

If your quartile calculations end up being odd numbers—such as 30.5 for the first quartile—you must choose the value that lies in the middle of observations 30 and 31. So, if observation 30 = 180 cm and observation 31 = 190, the value of the first quartile is 185 cm (the average).

Percentiles

Suppose you had taken a statistics exam and wanted to see your grade relative to the grades of the other students. If you were in the 70th percentile, that would mean that 70% of students received a grade that was lower than yours—or, conversely, that you were among the 30% who received top marks. This allows us to quickly consider a single item (one grade) relative to all of the items (all grades). Percentiles are calculated in the same manner as quartiles. Instead of quartiles (K), the percentage is used (P)—see below.

Calculation of percentile: (n + 1) × P/100 where “P” represents the percentile

Kurtosis

As skew is used to measure distributional symmetry, kurtosis is used to calculate how steep a distribution is. Just like skew, kurtosis can give us valuable insights into the properties of a distribution. This can be especially beneficial in situations where many variables are involved, such as equity research that compares the price trends for many companies.

Calculation of Kurtosis:

Kurtosis indicates whether a distribution is relatively sharp or flat compared to a normal distribution. A positive kurtosis means that a distribution is relatively peaked (leptokurtic distribution) while a negative kurtosis (platykurtic distribution) indicates a flat distribution. A perfectly normal distribution (mesokurtic distribution) will have a kurtosis of 0.

In relation to stock analysis, a peak distribution is a sign that a relatively high number of items have the same value as the average, and the remaining are relatively dispersed from the average. Conversely, a flat distribution indicates that many items are spread around the average, and thus—ceteris paribus—the stock is less prone to large price fluctuations.

Point Estimates for Grouped Data

When working with large volumes of data, such as in market analysis, observations are often divided into intervals to provide an overview. When we work with observations grouped into intervals, we do not know the exact value of any observation, but we know that it can assume any value within a given interval.

From this table, we have a good overview of how the observations are distributed within each range. It appears, for example, that most observations are between 10,000 and 20,000.

This overview is achieved at the expense of detailed information about the value of each observation.We do not know, in other words, the exact value of each of the 93 items in the range of 0-10,000. The only thing we know about the observations is that they lie inside the interval.

For this reason, we cannot calculate the average as previously described, because that method requires that we know the exact value of each observation.

Alternatively, we can use the interval midpoint as a substitute for the real value. This approach has an obvious weakness when data are not normally distributed. In this case, the observations are predominantly located at one end of the range, which means that the interval midpoint (Mi) is not representative.

The standard deviation for a grouped data set (sample):

Summary of Point Estimates

In Summary:

Just as words can describe a face, point estimates can describe data. This is not particularly relevant if you work with a small amount of data, but if you are one day sitting with 30,000 rows of figures in a spreadsheet and don’t have an overview, point estimates can give you quick and valuable insights.

The average and standard deviation are useful for indicating the data set midpoint and the range in which we can expect most of our observations to lie. If our data are normally distributed, we can use the mean and standard deviation to identify the interval in which approx. 70% of our items will lie.

A prerequisite for using the mean and standard deviation is that the data need to be relatively normally distributed. It is always a good starting point to calculate the skew in order to investigate the extent to which the data are symmetrically distributed. If the data are skewed, the median and interquartile range should be used as alternatives to the mean and standard deviation.

Graphs: Illustrations of Data

As we discussed in the previous section, point estimates are suitable for describing large amounts of data, using key figures such as mean and standard deviation. Graphs serve the same purpose. The approach here is simply to present data visually, with an emphasis on simple communication. The strength of graphs is that most people can interpret a visual representation of data, while fewer are aware of the importance of concepts like standard deviation and interquartile range.

In the following section, we will review the most frequently used graphs. The chapter will conclude with a discussion of the pitfalls of graphs, and areas where you need to pay special attention to visual manipulation.

Pie Charts

We see pie charts almost everyday in newspapers and on television.The pie represents basically the entire data set, which is then broken down into various categories within the pie. Pie charts ar every intuitive when few categoriesare compared, but if you increase the number of categories, you can lose track of them quickly. This is partly because color nuances can be difficult to separate. Another reason is that a pie chart indicates values using the angle of the segments forming the pie

Interestingly enough, we are physiologically better equipped to distinguish the differences between vertical lines than the differences between angles. A bar chart can better illustrate small differences than a pie chart, as evidenced by the two figures below.

Pictograms

If you want to represent data as visually as possible, pictograms are the obvious choice. Pictograms are particularly well suited to communicating clear trends. Data values will often be highly simplistic.An example might be a situation in which car sales have risen sharply over the years, as illustrated in the figure below. The disadvantage to these types of diagrams is that they are a bit vague and do not quite correspond to scale. See the pictogram for 2007 car sales in the figure below—it is difficult to assess actual car sales.

Bar Graphs

Bar graphs are fairly self-explanatory, as you can see from the figures below.

There are a few issues you should be aware of. The width of the columns and the distance between them does not matter—only the height of each column affects its value.

On a bar graph, the y-axis should generally start with the value 0. However, it might be a good idea to let the bar graph start at a higher value in order to clarify the difference between the columns. If you choose this approach, it is extremely important that you comment that the bar graph does not start at 0. The difference between the individual columns will thus be reinforced.

Excel allows you to adjust the width and distance between columns and specify a starting value for the y-axis. 

Line Charts

Line charts, also known as line graphs, are well suited for showing development over a longer period. Line charts are widely used when reporting stock prices.

A unique property of the line graph is that it can be reduced to the size of a stamp without sacrificing a significant amount of information.

The eye can quickly decode a sequence represented by a line. With just a few supporting points indicating the minimum and maximum price added, it is relatively simple to put the development for the entire period into perspective.

Histograms

Histograms are often confused with bar graphs. However, if we look closely at the graphs, we will notice some important differences. Unlike with a bar graph, the width of columns is significant when reading a histogram. The X-axis is based on a numerical scale, which assigns each interval a specific value. The Y-axis indicates the number of items in each interval. There is often no title assigned to they-axis. A histogram always illustrates the number of items as measured by frequency or probability.

Index Numbers and Charts

We see index numbers in many contexts. For example, they are often used to illustrate the development of housing markets, where they are used to describe the relative price performance fora specific year. The advantage of index numbers is that they convert each particular development into a value that is comparable with other index numbers.

Let us consider a simple example: Companies that perform well usually have increasing revenues,but this is greatly affected by inflation. When we analyze sales, it might be interesting to investigate whether the growth rate has only kept up with inflation or whether there has been real growth, which simply means that revenue has increased more than inflation.Since inflation is a macroeconomic concept, you cannot directly compare a company’s growth rate using ratios. But,if we use an index showing the evolution of both the inflation rate and growth rate,we can actually compare them, as seen in the figure below.

The graph shows that growth has kept pace with inflation and that the growth rate has been only marginally higher than inflation. This suggests that the company’s growth has been helped by a general increase in prosperity within society. From here, it is not far fetched to assume that this same relationship will hold if inflation falls. Now it is possible to discuss whether revenue growth is real or just driven by inflation.

Simple vs. Composite Index

When we talk about an index, we need to distinguish between a simple and composite index. A composite index is used, for example, when the price of an entire group of items must be compared internationally. As consumers have many options for spending their money, it is necessary to refine the index to represent a wide range of products.

Let us first consider an example involving a simple index in a society where the only product you can buy is bread.

If the price of bread rises from £12 to £15, is it not quite enough to say that the price of bread has risen by £3. This does not tell us how large the relative increase has been. If the cost of bread rises from £100 to £103, the nominal price increase of £3 would be the same, while the relative price increase would be about seven times smaller (3 / 15 versus 3 / 103).

Thus, there is a need to measure relative price performance. This is the essence of an index. An index shows us the relative change of a variable over a certain period, e.g., the price of bread over the past 5 years or population growth over the past 10 years.

Be aware that an index is usually expressed as a percentage. The starting point, known as the base year, is always 100%. This means index values over 100 represent an increase over the base year.Index values below 100, conversely, represent a decrease compared to the base year. Our index in Figure 13 reveals that the price increase from 2001 to 2002 was 11%.

But, when we look at the trend from 2002 to 2003, we cannot logically transfer this over and say that the price increased by 6%. Index numbers should always be developed relative to the base year. So, to calculate the change from 2002 to 2003, we need to consider the values from those two years relative to one another. To do this, we need to divide the index number for 2003 with the index number for 2002:

1,17/1,11 = 1,054. This shows a real price increase of 5.4%.

Let us broaden the example and assume that consumers in the village can buy products other than bread.To calculate the price index, we must now take into account that the index represents a wide range of groceries and that households do not split their income equally across all product groups. In this context, we can consider a whole group of products called the market basket. The market basket represents what the average household typically purchases. The value of this basket is our starting point for the base year. In subsequent years, we can see how market basket prices have evolved.The problem with this approach, as you may have guessed, is that we assume that people buy the same quantity of goods as in the base year, regardless of price. To calculate a more representative index, we can use two different methods, known as the Laspeyres and Paasche indexes.

Laspeyres Index

The Laspeyres index is based on the assumption that people are continuing to buy the same quantity of goods as they did during the base year. Thus, the only change is in prices. In this sense,the Laspeyres index assumes prices are set solely by index development.

Paasche Index

When we use the Paasche index, we use the opposite starting point. That is to say, we assume that people bought the same quantity of goods during the base year as they are buying now. So, if they bought 30 loaves of bread this year, we assume they also bought 30 loaves of bread in the starting year.

Which of these two indexes is better? Well, since both indexes make simplistic assumptions about consumption, you have to ask yourself which of the two simplifications affects you the least.

The Laspeyres index makes the assumption that people buy the same amount of a specific good as they did in the base year. If you calculate the Laspeyres index for a period of 10 years, you assume there has been no change in the consumption of goods over the past 10 years. This assumption may be quite true for certain products, such as toothpaste, but not for goods where sales are influenced by fashion and trends. Thus, the assumption of static consumption can make the index inaccurate.

The Paasche index takes into account that consumption changes. However, it is more time consuming to use the Paasche index than the Laspeyres index. Suppose we were to calculate the Paasche index for a family consisting of several hundred products. For each of these products,we’re going to gather price information for the quantity being purchased. Additionally, while the base index, which is the index for the base year, is calculated once and for all with the Laspeyres method, it changes every year with the Paasche method. Because the index number of the base year changes yearly, all index numbers will, therefore, change each time that the index is updated with new data.

Descriptive Statistics in a nutshell

Population parameters

Population parameters for grouped data

Point estimates (based on a sample)

Point estimates for grouped data (sample)