Fitting a straight line

This example shows how to find a linear least squares fit for a set of points.

Suppose you have a set of data points that you believe were generated by a process that should ideally be linear. In that case, you might like to find the best parameters m and b to make the line y = m * x + b fit those points as closely as possible.

Gun ownership vs gun deaths

A common approach to this problem is to minimize the sum of the squares of the vertical distances between the line and the points. For example, suppose the point P0 = (x0, y0) is one of your data points. The vertical error squared for that point is the difference between y0 and the line’s Y coordinate for that X position. In this case, that’s y0 – (m * x0 + b). To calculate the total error squared, square this error and add up the errors squared for all of the points.

Keep in mind that you know all of the points so for given values of m and b you can easily loop through all of the points and calculate the error.

Here’s a function that does just that:

  ; Return the error squared:
  Func ErrorSquared($aPoints, $M, $B)
    Local $Total = 0
    For $I = 1 To $aPoints[0]
      Local $aPoint = $aPoints[$I]
      Local $Dy = $aPoint[2] - ($M * $aPoint[1] + $B)
      $Total += $Dy ^ 2
    Next
    Return $Total
  EndFunc

This code loops through the points subtracting each point’s Y coordinate from the coordinate of that line at the point’s X position. It squares the error and adds it to the total. When it finishes its loop, the method returns the total of the squared errors.

As a mathematical equation, the error function E is:

Equation1

where the sum is performed over all of the points (xi, yi).

To find the least squares fit, you need to minimize this function E(m, b). That sounds intimidating until you remember that the xi and yi values are all known – they’re the values you’re trying to fit with the line.

The only variables in this equation are m and b so it’s relatively easy to minimize this equation by using a little calculus. Simply take the partial derivatives of E with respect to m and b, set the two resulting equations equal to 0, and solve for m and b.

Taking the partial derivative with respect to m and rearranging a bit to gather common terms and pull constants out of the sums you get:

Equation2

Taking the partial derivative with respect to b and rearranging a bit you get:

Equation3

To find the minimum for the error function, you set these two equations equal to 0 and solve for m and b. To make working with the equations easier, let:

Equation4

If you make these substitutions and set the equations equal to 0 you get:

Equation5

Solving for m and b gives:

Equation6

Again these look like intimidating equations but all of the S’s are values that you can calculate given the data points that you are trying to fit. The following code calculates the S’s and uses them to find the linear least squares fit for the points in a List(Of PointF).

  ; Find the least squares linear fit:
  Func LineFit($aPoints, ByRef $M, ByRef $B)
  ; Find the values S1, Sx, Sy, Sxx, and Sxy.
  ; Perform the calculation.
    Local $S1 = $aPoints[0]
    Local $Sx = 0, $Sy = 0, $Sxx = 0, $Sxy = 0
    For $I = 1 To $aPoints[0]
      Local $aPoint = $aPoints[$I]
      $Sx += $aPoint[1]
      $Sy += $aPoint[2]
      $Sxx += $aPoint[1] ^ 2
      $Sxy += $aPoint[1] * $aPoint[2]
    Next

    ; Solve for $M and $B :
    $M = ($Sxy * $S1 - $Sx * $Sy) / ($Sxx * $S1 - $Sx ^ 2)
    $B = ($Sxy * $Sx - $Sy * $Sxx) / ($Sx ^ 2 - $S1 * $Sxx)

    Return Sqrt(ErrorSquared($aPoints, $M, $B))
  EndFunc

I took this further with a routine to remove an outlier:

  ; Remove one outlier (if any)
  ; Return index of the point removed
  Func RemoveOutlier(ByRef $aPoints, $M, $B)
    Local $Total = 0
    Local $iOutlier
    Local $aDy2 = aNewArray($aPoints[0])
    For $I = 1 To $aPoints[0]
      Local $aPoint = $aPoints[$I]
      $aDy2[$I] = ($aPoint[2] - ($M * $aPoint[1] + $B)) ^ 2
      $Total += $aDy2[$I]
    Next
    Local $Max = aMax($aDy2)
    Local $Mean = $Total / $aPoints[0]
    If $Max > 0 And $aPoints[0] > 3 And $Max > 4 * $Mean Then
      $iOutlier = aSearch($aDy2, $Max)
      $Total -= $aDy2[$iOutlier]
      aDelete($aPoints, $iOutlier)
    EndIf
    Return $iOutlier
  EndFunc
  Func DataSet1(ByRef $aPoints)
    aAdd($aPoints, aConcat(1, 2 + .1))
    aAdd($aPoints, aConcat(2, 3 - .1))
    aAdd($aPoints, aConcat(3, 4 + .1))
    aAdd($aPoints, aConcat(4, 5 - .5))
    aAdd($aPoints, aConcat(5, 6 + .1))
    aAdd($aPoints, aConcat(6, 7 - .1))
    aAdd($aPoints, aConcat(7, 8 + .1))
    aAdd($aPoints, aConcat(8, 9 - .1))
    aAdd($aPoints, aConcat(9, 10 + .1))
  EndFunc
  Func TestLineFit()
    Local $M, $B
    Local $aPoints = aNewArray()
    DataSet1($aPoints)
    Local $E = LineFit($aPoints, $M, $B)
    Msg('The Outcome', 'm= ' & $M & @LF & ' b= ' & $B)
  EndFunc
  Func TestBestLineFit()
    Local $M, $B
    Local $aPoints = aNewArray()
    DataSet1($aPoints)
    Do
      Local $N = $aPoints[0]
      Local $E = LineFit($aPoints, $M, $B)
      Msg('Outcome', 'm= ' & $M & @LF & ' b= ' & $B)
      Local $iRemoved = RemoveOutlier($aPoints, $M, $B)
      If $iRemoved Then Msg('Removal', 'item= ' & $iRemoved)
    Until $iRemoved = 0
    Msg('BestFit', 'm= ' & $M & @LF & ' b= ' & $B)
  EndFunc
  TestBestLineFit()

Keywords:
algorithms, least squares, linear least squares, curve fitting, graphics.

Gun related deaths - UK vs US
Gun ownership - UK vs US


Bert Kerkhof
usually writes about health care and the rule of law

Waardeer dit blog

Dit artikel las je gratis. De moeite waard? Laat je waardering zien en draag €1 bij of een veelvoud daarvan. Via Paypal of creditcard.

€ 1,00

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit /  Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit /  Bijwerken )

Verbinden met %s

Site gemaakt door WordPress.com.

Omhoog ↑

%d bloggers liken dit: