How to find a line that minimizes the average squared perpendicular distance from the given points to the line?

I have set of points scattered around the origin. How to find a vector, such that the average squared distance (perpendicular distance) from points to the vector is minimised?

Added

For example, let the dimension be 2. So a vector can be written (from the origin) as $(a_1,a_2)$ . Now take a point $(x_1, x_2)$ . Drop a perpendicular line from the point to the vector and I want to find the length of that line segment.

asked Jun 19, 2014 at 10:08 user157976 user157976 41 1 1 silver badge 2 2 bronze badges $\begingroup$ What do you mean by the distance from a point to a vector? $\endgroup$ Commented Jun 19, 2014 at 10:12

$\begingroup$ Let the dimension to be 2. So a vector can be written (from the origin) as (a1,a2). Now take a point (x1, x2). Drop a perpendicular line from the point to the vector and I want to find the length of that line segment. $\endgroup$

Commented Jun 19, 2014 at 10:26

2 Answers 2

$\begingroup$

If I well understand the question, this is the problem of fitting with perpendicular offsets.

In case of 3D. , see pages 2-12 in section "3D Linear Regression" of the paper : http://fr.scribd.com/doc/31477970/Regressions-et-trajectoires-3D

answered Aug 6, 2014 at 4:03 JJacquelin JJacquelin 67.1k 3 3 gold badges 37 37 silver badges 89 89 bronze badges $\begingroup$

I have done this in the past. The key is to write the line in polar form (one reference is here: http://www.robertobigoni.eu/Matematica/Lines/lines08/lines08.html ).

In this form, the parameters are the distance of the line from the origin and the angle the line makes with the X axis.

The advantage of this form of the line is that the distance from any point to the line is easily computed.

It turns out that the line which minimizes the sum of squares of the distances from the points to the line passes through the centroid of the data points. Of all the lines which do this, there are two for which the derivative of the sum of squares of the distances is zero - one which minimizes the sum of squares (which is the one you want) and one, which is orthogonal to the first, which maximizes the sum of squares.

It turns out that this can me derived without any calculus.

I'll see if I can resurrect my results, but this might be a good start.

Here is what I wrote, after converting from $\LaTeX$ to MathJax:

There are a number of forms that the equation of a straight line can take (e.g., point-slope, intersection with axes, $y~=~mx+b$). The form most useful for our purpose is the $polar$ form. In this form, a line $L$ is determined by its distance r from the origin and the angle $\theta$ that the normal from the origin to the line makes with the $x$ axis, the angle being measured counterclockwise.

The equation of $L$ in terms of r and $\theta$ is
$$L:~~ x\cos \theta + y \sin \theta ~=~r.$$

This can be verified by noting that $x~=~r/\cos\theta$ at $y~=~0$ and $y~=~r/\sin\theta$ at $x~=~0$.

Another derivation of equation (1) for $L$ is as follows:

Let $(x,y)$ be a point on $L$, $d$ the distance of $(x,y)$ from the origin, and $\phi$ the angle that the line from the origin to $(x, y)$ makes with the $x$-axis. The angle between this line and the normal to $L$ is easily seen to be $\theta-\phi$. Thus, $r/d ~=~\cos(\theta-\phi)$.
Since $x~=~d\cos \phi$ and $y~=~d\sin \phi$, $ x \cos \theta + y \sin \theta = d \cos (\theta-\phi) = r. $

This derivation, though more complicated than the preceding one, has the advantage of also giving explicit formulae for $x$ and $y$ in terms of $r$, $\theta$, and $\phi$, $x~=~$ and $y~=~.$

The polar form for the equation of $L$ is so useful because the distance from
any point $(u, v)$ to $L$ is $u\cos \theta + v \sin \theta - r$. To show this, consider the line $L'$ through $(u, v)$ that is parallel to $L$. If $d$ is the distance from $L'$ to $L$, which is also the distance from $(u, v)$ to $L$, the equation of $L'$ is $x\cos\theta+y\sin\theta~=~d+r$, since $L$ and $L'$ are parallel.
Since $(u, v)$ is on $L'$, $u\cos\theta+v\sin\theta~=~d+r$, so that, as claimed, $d~=~u\cos\theta+y\sin\theta-r$.

Evaluating the mean squared error

We now define some notation and abbreviations.

Let $c~=~\cos \theta$ and $s~=~\sin \theta$, so the equation of $L$ is $cx+sy~=~r$. The data points used to fit $L$ are $(x_i, y_i)$ for $i=1$ to $n$ (i.e., there are $n$ points).
For any expression f, we define $mean(f)$ to be the average of f over all the data points, so that $$mean(f) ~=~(1/n)\sum_^n f_i.$$

For example, $mean(x) ~=~(1/n)\sum_^n x_i$, and $mean(xy) ~=~(1/n)\sum_^n x_i y_i$.

Note: We can also define $mean(f) $ to be a weighted mean $$mean(f) ~=~^n f_i w_i \over \sum_^n w_i>$$ where each $w_i > 0$, and the results which follow are not affected at all.

If $p$ and $q$ are any of $x$ and $y$, we define the covariance between the variables $p$ and $q$ to be $$cov(p,q)~=~mean(pq) -mean(p) mean(q).$$

For example, $cov(x,x)~=~mean(x^2)-mean(x)^2$ and $cov(x,y)=mean(xy)-mean(x)mean(y).$

If D is the mean squared error of $L$, then

$\begin\\ D &= mean((distance\ from\ point\ i to\ L)^2\\ &= mean(d^2)\\ &= mean((cx+sy-r)^2) \\ &= mean(c^2x^2+s^2y^2+r^2 + 2csxy-2crx-2sry) \\ &=~ c^2\ mean(x^2) +s^2\ mean(y^2) +r^2+ 2sc\ mean(xy) -2cr\ mean(x) -2sr\ mean(y)\\ \end $

Minimizing the mean squared error

If $L$ is to be the best fitting line in the least mean squared sense, we must have $<\partial D \over \partial r>~=~0$ and $<\partial D \over \partial \theta>~=~0.$

However, the values of $r$ and $\theta$ that minimize $D$ can be found without using any calculus. This will now be done by writing $D$ as the sum of terms which, when independently minimized, give the desired values for $r$ and $\theta$.

Letting $S=\sin 2\theta =2sc$ and $C=\cos 2\theta=c^2-s^2$, since $c^2=(1+C)/2$ and $s^2=(1-C)/2$, $$\eqalignno < D~&=~+ C + S cov(x,y) +(r-c\ mean(x) -s\ mean(y))^2\cr &=~D_1+C~D_2+S~D_3 +(r-c\ mean(x) -s\ mean(y))^2\cr >$$ where $ D_1=~, $ $ D_2=~ -cov \over 2>, $ and $ D_3~=~cov(x,y) $.

Since $\cos(2\theta-\phi) \ge -1$ and $(r-c mean(x)-s mean(y))^2 \ge 0$, $D \ge D_1-R$. By choosing $$\theta~=~ <\phi+\pi \over 2>\ \ r~=~mean(x)\cos\theta + mean(y)\sin\theta,$$ $D$ will achieve its minimum value $$D~=~D_1 - R.$$