Intuitive Curve Fitting
Intuitive Curve Fitting
Earlier, we used various means, such as histograms, frequency polygons and ogives, to visualise our data. These are very useful tools to depict univariate data, i.e. data with only one variable such as the height of learners in a class.
Last year we also learnt about a visual tool called scatter plots. Scatter plots are a common way to visualise bivariate data, i.e. data with two variables. This allows us to identify the direction and strength of a relationship between two variables.
We identify the nature of a relationship between two variables by examining if the points on the scatter plot conform to a linear, exponential, quadratic or some other function. The process of fitting functions to data is known as curve fitting.
The strength of a relationship can be described as strong if the data points conform closely to a function or weak if they are further away.
In the case of linear functions, the direction of a relationship is positive if high values of one variable occur with high values of the other or negative if high values of one variable occur with low values of the other.
The table below summarises the different relationships:
| Strong, positive linear relationship | Strong, negative linear relationship |
| Weak, positive linear relationship | Exponential relationship |
| Quadratic relationship | No relationship |
Example
Question
Examine the scatter plot below of data collected from a new shop:
- What are the two variables being compared?
- What type of function best fits the data?
- Is the relationship between the two variables strong or weak?
- Is the relationship between the two variables positive or negative?
- Using your answers above, describe the relationship between the two variables in one sentence.
- The variables being compared are average daily number of customers and time in months.
- The data fit an exponential function.
- The data points appear to fit the curve close to perfectly, so the relationship can be described as very strong.
- As time increases, the number of customers increases, so the relationship can be described as positive.
- There is a very strong, positive, exponential relationship between average daily customers and time in the new shop.
In the worked example above, by plotting the average daily customers and time data of a new shop on a scatter plot, we were able to identify the relationship between the two variables. Once we know the relationship between two variables, we are able to do another very useful thing - we are able to predict values where no data exist.
Definition: Interpolation and extrapolation
When we predict values that fall within the range of our data, this is known as interpolation. When we predict the values of a variable beyond the range of our data, this is known as extrapolation.
Extrapolation must be done with caution unless it is known that the observed relationship continues beyond the range of our data. For example, an exponential function may look linear if we only have the first few data points available but if we extrapolate far enough beyond the initial data points, our predictions will be inaccurate.
In order to interpolate or extrapolate values, we need to find the equation of the function which best fits the data. For linear data, we draw a straight line through the data which best approximates the available data points. This line is known as the line of best fit or trend line. Let us try our hand at this in the following example.
Example
Question
- Use the data below to draw a scatter plot and line of best fit.
- Write down the equation of the line that best seems to fit the data.
- Use your equation to calculate the estimated value for \(y\) if \(x = 4\).
- Use your equation to calculate the estimated value for \(x\) if \(y = 6\).
|
\(x\) |
\(\text{1.0}\) |
\(\text{2.4}\) |
\(\text{3.1}\) |
\(\text{4.9}\) |
\(\text{5.6}\) |
\(\text{6.2}\) |
|
\(y\) |
\(\text{2.5}\) |
\(\text{2.8}\) |
\(\text{3.0}\) |
\(\text{4.8}\) |
\(\text{5.1}\) |
\(\text{5.3}\) |
Draw the graph
- Choose a suitable scale for the axes.
- Draw the axes.
- Plot the points.
Drawing the line of best fit
The next step is to draw a straight line which goes as close to as many points as possible. It is generally best to have as many points above the line as below the line.
Calculating the equation of the line
The equation of the line is
\(y=mx+c\)From the graph we have drawn, we estimate the y-intercept to be \(\text{1.5}\). We estimate that \(y=\text{3.5}\) when \(x=3\). So we have that points \((3;\text{3.5})\) and \((0;\text{1.5})\) lie on the line. The gradient of the line, m, is given by
\begin{align*} m & = \cfrac{\Delta y}{\Delta x} = \cfrac{{y}_{2}-{y}_{1}}{{x}_{2}-{x}_{1}} \\ & = \cfrac{\text{3.5}-\text{1.5}}{3-0} \\ & = \cfrac{2}{3} \end{align*}So we finally have that the equation of the line of best fit is
\(y=\cfrac{2}{3}x+\text{1.5}\)Calculate the unknown values
The equation of the line is \(y=\cfrac{2}{3}x+\text{1.5}\) so in order to find the unknown values, we insert the known values into our equation.
For \(x = 4\):
\begin{align*} y &=\cfrac{2}{3} \cdot 4 +\text{1.5}\\ &= \text{4.17} \end{align*}Since this \(x\)-value is within the data range, this is interpolation.
For \(y = 6\):
\begin{align*} 6 & =\cfrac{2}{3} \cdot x +\text{1.5} \\ \therefore x &= (6 - \text{1.5}) \times \cfrac{3}{2} \\ &= \text{6.75} \end{align*}Since this \(y\)-value is outside the data range, this is extrapolation.
In the previous worked example, we drew the line of best fit by hand. This can give us a reasonable approximation of which function best fits the data when the data points are close together. However, you may have found that you obtained slightly different answers from one another. In the next section, we will learn about a more precise way of fitting a linear function to data.
This lesson is part of:
Statistics and Probability