this is for holding javascript data
Edward Brown deleted figures/sample-datasets/FittingData.ipynb
about 9 years ago
Commit id: 1899aa335a649ab664d345184b62124d7c173b07
deletions | additions
diff --git a/figures/sample-datasets/FittingData.ipynb b/figures/sample-datasets/FittingData.ipynb
deleted file mode 100644
index f6c5319..0000000
--- a/figures/sample-datasets/FittingData.ipynb
+++ /dev/null
...
{
"metadata": {
"name": "",
"signature": "sha256:3a516480996866b6e191eb0415c8800297bd3471679e625f31f9974815c8534c"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Fitting a line to a dataset\n",
"\n",
"##Introduction\n",
"Suppose we are given a set of data pairs, $\\{(x_i,y_i)\\}_{i=1,\\ldots,N}$. Let's assume that the $x_i$ are known, but that the $y_i$ have uncertainties. That is, the measured value $y_i$ is the \"true\" value plus some additional fluctuation. We further assume (perhaps we have some model or hypothesis) that the $y$ and $x$ are linearly related, although we don't know the slope or intercept of that line.\n",
"\n",
"Because of the fluctuations, no line will perfectly fit the data, but we can find a best fit line $y(x) = ax + b$ by adjusting the parameters $a$ and $b$ to maximize the probability of our dataset. In practice this means finding $a$, $b$ that minimize\n",
"\\begin{equation*}\n",
" \\chi^2 = \\sum_{i=1}^N \\frac{\\left[y_i - y(x_i)\\right]^2}{\\sigma^2}.\n",
"\\end{equation*}\n",
"Even though this line will be the one that fits the data the best, it doesn't necessarily imply, however, that the fit is any good. To determine the goodness-of-fit, we look at $\\chi^2$ and compare its value to what we would expect if our assumptions are correct: namely, that the $y$ and $x$ pairs are linearly related, but the $y$ have some flucuations that obey a gaussian relation with standard deviation $\\sigma$.\n",
"\n",
"To show how this works, we'll use `Python` to construct datasets that violate these assumputions, and then we'll see what happens to the $\\chi^2$ values. \n",
"\n",
"##Set up the workspace\n",
"As before, we'll load in `numpy` and `matplotlib.pyplot`. We'll also bring into our *namespace* the `random` function from the `numpy.random` module. This is done for convenience, to avoid typing `np.random.random`. We also import a class, `sampleDataSets`, from the file `sample_datasets`, which you should have in the directory containing this notebook."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%matplotlib inline\n",
"import numpy as np\n",
"from numpy.random import random\n",
"import matplotlib.pyplot as plt"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Create a class `sampleDataSets`\n",
"This sets up a class to make the fake datasets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from numpy import linspace,zeros\n",
"from numpy.random import standard_normal, random, random_integers\n",
"\n",
"class sampleDataSets:\n",
" \"\"\"\n",
" Sets up a linear relation, y = m*x + b. Datasets can be generated \n",
" from this relation by adding gaussian fluctuations to each y. The std. \n",
" deviation of the fluctuation are chosen from a uniform random distribution \n",
" between 0.3 and 0.7. There are 3 choices for datasets.\n",
" 1. Fits the data much better than would be indicated by the size of its\n",
" quoted uncertainties. The real fluctations have a std. dev. that is \n",
" 1/5 of the quoted one.\n",
" 2. Uncertainties are drawn from a normal distribution with a standard\n",
" deviation matching the size of the errorbars. This should produce\n",
" an ideal chi^2 distribution if many trials are conducted.\n",
" 3. Identical to 2, but 20% of the datapoints are given 5 sigma \n",
" fluctuations.\n",
" \"\"\"\n",
" \n",
" _slope = 3.0\n",
" _intercept = 1.0\n",
" _sig_low = 0.3\n",
" _sig_high = 0.7\n",
"\n",
" def __init__(self):\n",
" \"\"\"\n",
" Sets the relation, the nominal size of the errorbars, and the actual \n",
" size of the errorbars.\n",
" \"\"\"\n",
" a = self._sig_low\n",
" b = self._sig_high\n",
" self._amp = (b-a)*random() + a\n",
" self._fake = 0.2*self._amp\n",
" \n",
" def make_dataset(self,x,use_fake=False,with_outliers=False):\n",
" \"\"\"\n",
" Constructs a dataset from the given relation with errorbars drawn from \n",
" a normal distribution.\n",
" \n",
" Arguments\n",
" ---------\n",
" x := [array-like] the values at which the relation should be \n",
" evaluated\n",
" use_fake := if True, then reduce the standard deviation of the \n",
" fluctuations by a factor of 10. This dataset will \n",
" have an anomalously low chi^2.\n",
" with_outliers:= if True, then 20% of the points will have 10 sigma \n",
" fluctuations.\n",
" \n",
" Returns\n",
" -------\n",
" y := an ndarray of length(x.size) containing the dataset\n",
" \"\"\"\n",
" \n",
" m = self._slope\n",
" b = self._intercept\n",
" if use_fake:\n",
" sig = self._fake\n",
" else:\n",
" sig = self._amp\n",
" y = m*x + b + sig*standard_normal(len(x))\n",
" if with_outliers:\n",
" n = int(0.2*x.size)\n",
" sgn = zeros(n)\n",
" for i in range(n):\n",
" if random() < 0.5:\n",
" sgn[i] = -1.0\n",
" else:\n",
" sgn[i] = 1.0\n",
" indcs = random_integers(0,x.size-1,size=(2))\n",
" y[indcs] = m*x[indcs] + b + sgn*5.0*sig\n",
" return y\n",
" \n",
" def quoted_error(self):\n",
" \"\"\"\n",
" returns the quoted standard deviation\n",
" \"\"\"\n",
" return self._amp\n",
" \n",
" def unrealistic_dataset(self,x):\n",
" \"\"\"\n",
" Returns a dataset with actual uncertainties much less than the quoted \n",
" errorbars.\n",
" \n",
" Arguments\n",
" ---------\n",
" x := [array-like] the values at which the relation should be \n",
" evaluated\n",
" \n",
" Returns\n",
" -------\n",
" y := an ndarray of length(x.size) containing the dataset \n",
" \"\"\"\n",
" \n",
" return self.make_dataset(x,use_fake=True)\n",
"\n",
" def realistic_dataset(self,x):\n",
" \"\"\"\n",
" Returns a dataset with uncertainties that agree with the quoted \n",
" errorbars.\n",
" \n",
" Arguments\n",
" ---------\n",
" x := [array-like] the values at which the relation should be \n",
" evaluated\n",
" \n",
" Returns\n",
" -------\n",
" y := an ndarray of length(x.size) containing the dataset \n",
" \"\"\"\n",
"\n",
" return self.make_dataset(x,use_fake=False)\n",
"\n",
" def dataset_with_outliers(self,x):\n",
" \"\"\"\n",
" Returns a dataset with 80% of the points drawn from the normal \n",
" distribution, and 20% of the points having 10-sigma fluctuations.\n",
"\n",
" Arguments\n",
" ---------\n",
" x := [array-like] the values at which the relation should be \n",
" evaluated\n",
" \n",
" Returns\n",
" -------\n",
" y := an ndarray of length(x.size) containing the dataset \n",
" \"\"\"\n",
"\n",
" return self.make_dataset(x,use_fake=False,with_outliers=True)\n",
" \n",
" def fit(self,x):\n",
" \"\"\"\n",
" Returns the true relation, y = m*x + b\n",
"\n",
" Arguments\n",
" ---------\n",
" x := [array-like] the values at which the relation should be \n",
" evaluated\n",
" \n",
" Returns\n",
" -------\n",
" y := an ndarray of length(x.size) containing the underlying \n",
" relation.\n",
" \"\"\"\n",
" \n",
" m = self._slope\n",
" b = self._intercept\n",
" return m*x + b\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Some sample datasets\n",
"Now we are ready to make some trial datasets. Let's extract 10 points,\n",
"\n",
" Ndata = 10\n",
" \n",
"The line\n",
"\n",
" d = sampleDataSets()\n",
" \n",
"initializes our sample: it sets the slope, intercept, and standard deviation of our points. We then pick some $x$-values, calculate the true linear realation\n",
"\n",
" xfit = np.linspace(0.0,3.0,Ndata)\n",
" tfit = d.fit(xfit)\n",
"\n",
"and show the results."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Ndata = 10\n",
"d = sampleDataSets()\n",
"xfit = np.linspace(0.0,3.0,Ndata)\n",
"tfit = d.fit(xfit)\n",
"plt.plot(xfit,tfit)\n",
"plt.xlabel('x')\n",
"plt.ylabel('y')\n",
"plt.title('The actual relation')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see what a realistic dataset might look like. We'll pick some $x$-values in the interval $[0,3)$ at random,\n",
"\n",
" x = 3.0*random(Ndata)\n",
" \n",
"and generate our dataset. You can rerun this cell several times to get a different realization of the data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = 3.0*random(Ndata)\n",
"r = d.realistic_dataset(x)\n",
"f = d.unrealistic_dataset(x)\n",
"o = d.dataset_with_outliers(x)\n",
"\n",
"plt.figure(figsize=(10,3))\n",
"plt.subplot(1,3,1)\n",
"plt.errorbar(x,r,yerr=d.quoted_error(),fmt='r.')\n",
"plt.plot(xfit,tfit)\n",
"plt.xlabel('x')\n",
"plt.ylabel('y')\n",
"plt.title('A realistic dataset')\n",
"plt.subplot(1,3,2)\n",
"plt.errorbar(x,f,yerr=d.quoted_error(),fmt='r.')\n",
"plt.plot(xfit,tfit)\n",
"plt.title('An unrealistic dataset')\n",
"plt.xlabel('x')\n",
"plt.ylabel('y')\n",
"plt.subplot(1,3,3)\n",
"plt.errorbar(x,o,yerr=d.quoted_error(),fmt='r.')\n",
"plt.plot(xfit,tfit)\n",
"plt.xlabel('x')\n",
"plt.ylabel('y')\n",
"plt.title('A dataset with outliers')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise:** Explain why the middle dataset is unrealistic."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Double-click in this box and type your answer here.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Testing the data\n",
"Now we'll look at the $\\chi^2$ test for the data, where $\\chi^2$ is defined in the introduction, above. First, we'll define a function `chi2`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def chi2(y,d,s):\n",
" \"\"\"\n",
" Computes chi^2 for a set of fitted points and data.\n",
" \n",
" Arguments\n",
" ---------\n",
" y := [array] points y(x) from the fit\n",
" d := [array] data points d(x)\n",
" s := std.deviation\n",
" \n",
" Returns\n",
" -------\n",
" chi2\n",
" \"\"\"\n",
" \n",
" q2 = (y-d)**2 / s**2\n",
" return np.sum(q2)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's put it to the test. We'll do `Ntrials = 1000` fits to our dataset. We'll compute $\\chi^2$ in two ways. The first, using our \"true\" values for $a$ and $b$; the second, using an values of $a$ and $b$ that are best fit from the data. \n",
"\n",
"**Stop:** Before proceeding, *predict* how the values of $\\chi^2$ computed with these two methods will differ.\n",
"\n",
"###The $\\chi^2$ distribution\n",
"First, we'll allocate two arrays, each of length `Ntrials`, to hold our results.\n",
"\n",
" chi2fit = np.zeros(Ntrials)\n",
" chi2best = np.zeros(Ntrials)\n",
"\n",
"Next, we'll make an array `t` of the \"true\" values of `y` for each `x`.\n",
"\n",
" t = d.fit(x)\n",
" \n",
"Now we will run our trials. For each trial, we'll make a realstic dataset, `s`, and then fit a line to that data and compute our estimated values of `y` for each `x`.\n",
"\n",
" p = np.polyfit(x,s,1)\n",
" y = x*p[0]+p[1]\n",
"\n",
"Finally, we'll compute the $\\chi^2$ for our \"true\" relation and for the fitted $a,b$.\n",
"\n",
" chi2fit[n] = chi2(t,s,d.quoted_error())\n",
" chi2best[n] = chi2(y,s,d.quoted_error())\n",
"\n",
"We'll then plot the histograms of the two different $\\chi^2$ distributions."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Ntrials = 1000\n",
"chi2fit = np.zeros(Ntrials)\n",
"chi2best = np.zeros(Ntrials)\n",
"t = d.fit(x)\n",
"for n in range(Ntrials):\n",
" s = d.realistic_dataset(x)\n",
" p = np.polyfit(x,s,1)\n",
" y = x*p[0]+p[1]\n",
" chi2fit[n] = chi2(t,s,d.quoted_error())\n",
" chi2best[n] = chi2(y,s,d.quoted_error())\n",
"alabel = '{0:5.2f}'.format(np.average(chi2fit))\n",
"flabel = '{0:5.2f}'.format(np.average(chi2best))\n",
"plt.xlabel(r'$\\chi^2$')\n",
"plt.ylabel(r'$N$')\n",
"plt.hist(chi2fit,bins=20,align='mid',histtype='step',label=r'actual: $\\chi^2 = '+alabel+r'$')\n",
"plt.hist(chi2best,bins=20,align='mid',histtype='step',label='best-fit: $\\chi^2 = '+flabel+r'$')\n",
"plt.legend(frameon=False)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise:** Explain the difference between these two distributions of $\\chi^2$. Are the averages what you expect?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Double-click in this box and type your answer.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Testing our datasets\n",
"Now that we have seen the distribution of $\\chi^2$ for our \"realistic\" dataset, let's look at how we can use this test to assess whether our fit is plausible.\n",
"\n",
"Recall that we stored our datasets in variables `r` (realistic), `f` (fake), and `o` (with outliers). We'll now fit each of those in turn, and compute a $\\chi^2$ for each, which we'll store in the variables `chi2r`, `chi2f`, `chi2o`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"p = np.polyfit(x,r,1)\n",
"y = x*p[0]+p[1]\n",
"chi2r = chi2(y,r,d.quoted_error())\n",
"print 'chi-squared for realistic data = {0:5.2f}'.format(chi2r)\n",
"\n",
"p = np.polyfit(x,f,1)\n",
"y = x*p[0]+p[1]\n",
"chi2f = chi2(y,f,d.quoted_error())\n",
"print 'chi-squared for data with over-large error bars = {0:5.2f}'.format(chi2f)\n",
"\n",
"p = np.polyfit(x,o,1)\n",
"y = x*p[0]+p[1]\n",
"chi2o = chi2(y,o,d.quoted_error())\n",
"print 'chi-squared for data with outliers = {0:5.2f}'.format(chi2o)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at where these values of $\\chi^2$ lie on our distribution `chi2best`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.title(r'$\\chi^2$ for a dataset that matches hypothesis')\n",
"plt.xlabel(r'$\\chi^2$')\n",
"plt.ylabel(r'$N$')\n",
"hst,bn,ptch = plt.hist(chi2best,bins=20,align='mid',histtype='step', label='dist. best-fit')\n",
"plt.vlines(chi2r, 0.0, max(hst), colors='k', linestyles='solid',label='matches hypothesis')\n",
"plt.legend()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.title(r'$\\chi^2$ for a dataset with unrealistic errorbars')\n",
"plt.xlabel(r'$\\chi^2$')\n",
"plt.ylabel(r'$N$')\n",
"hst,bn,ptch = plt.hist(chi2best,bins=20,align='mid',histtype='step', label='best-fit dist.')\n",
"plt.vlines(chi2f,0.0,max(hst),colors = 'g', linestyles='dashed', label='too large errorbars')\n",
"plt.legend()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.title(r'$\\chi^2$ for a dataset with outliers')\n",
"plt.xlabel(r'$\\chi^2$')\n",
"plt.ylabel(r'$N$')\n",
"hst,bn,ptch = plt.hist(chi2best,bins=20,align='mid',histtype='step', label='best-fit dist.')\n",
"plt.vlines(chi2o,0.0,max(hst),colors = 'r', linestyles='dotted', label='with outliers')\n",
"plt.legend()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##The $\\chi^2$ test\n",
"There is a theoretical distribution of $\\chi^2$ values. To see whether our fit is reasonable, we compute its value of $\\chi^2$ and compare that value against the distribution. Our question is, \"If the dataset is as we expect, what is the probability of getting a value of $\\chi^2$ larger than ours?\" If this is very unlikely, we conclude that our fit does not describe the data.\n",
"\n",
"Alternatively, we may get a probability that is very close to 1; in other words, almost *any* dataset would have a worse fit. This may suggest that the error bars are overestimated of that the data has been faked.\n",
"\n",
"The following block of code loads the `scipy.stats` module; it then defines a $\\chi^2$ distribution with 8 degrees of freedom (10 data points - 2 adjustable parameters), and calculates the cumulative probability of getting a value of $\\chi^2$ *larger* than `chi2r`, `chi2f`, and `chi2o`. Then to make things interesting, we scramble the order of our probabilities and print them.\n",
"\n",
"Your task: decide which probability matches the different datasets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import scipy.stats as st\n",
"df = 8# there are 10 data points and 2 adjustable parameters, so we have 8 degrees of freedom.\n",
"c2 = st.chi2(df)\n",
"ps = (c2.sf(chi2r),c2.sf(chi2f),c2.sf(chi2o))\n",
"# scramble the probabilities\n",
"ps = np.random.permutation(ps)\n",
"for lab, p in zip('abc',ps):\n",
" print '{0}. {1:9.2e}'.format(lab,p)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise**: Edit this cell. Beside each letter write the value of $p$ and the dataset \u2014 realistic, overlarge errorbars, or outliers \u2014 to which it corresponds.\n",
"\n",
"a. \n",
"\n",
"b. \n",
"\n",
"c. "
]
}
],
"metadata": {}
}
]
}