A histogram is a graphical representation of the distribution of numerical data.
It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson.
Components of a Histogram
1. Bins:
These are the ranges of values over which the data is grouped. Each bin represents a range of values, and the width of each bin can be uniform or varying depending on the data distribution.
2. Frequency
The height of each bin (bar) represents the frequency, which is the number of data points within each bin's range.
Steps to Construct a Histogram
To construct a histogram, follow these steps:
1. Determine the data range:
Identify the minimum and maximum values in the dataset.
2. Choose the number of bins:
Decide how many bins (intervals) you want to divide your data into. There are various rules like Sturges' formula, the square root rule, or the Freedman-Diaconis rule to help determine this.
3. Calculate bin width:
Divide the total range (max value - min value) by the number of bins to get the width of each bin.
4. Assign data to bins:
Count how many data points fall into each bin's range.
5. Draw the histogram:
Plot the bins along the horizontal axis and the frequency of data points along the vertical axis. Each bin is represented by a bar whose height corresponds to the frequency.
The histogram above represents a normal distribution of data, where the data is symmetrically spread around the mean (0 in this case), with most values clustered near the center and fewer values as you move away from the center.
Types of Histograms
There are several types of histograms based on the nature of the data and the way data is grouped:
1. Uniform Distribution Histogram:
Here, the data is uniformly spread across the range, showing a roughly equal number of data points in each bin.
2. Variable Width Histogram:
This histogram has bins of different widths, and the frequency is adapted to these widths.
This representation can be useful for highlighting how data clusters differently across varying ranges.
3. Symmetric Distribution Histogram:
The data is symmetrically distributed around a central value, typical of a normal or Gaussian distribution.
4. Right-Skewed Distribution Histogram:
The data has a long tail on the right side, indicating many data points are clustered at lower values, but some extend far to the right.
5. Bimodal Distribution Histogram:
This shows two peaks or modes, indicating that the data has two prevalent values around which data points cluster.