Data analysis for the life sciences pdf download






















We go from relatively basic concepts related to computing p-values to advanced topics related to analyzing high-throughput data. While statistics textbooks focus on mathematics, this book focuses on using a computer to perform data analysis. Instead of explaining the mathematics and theory, and then showing examples, we start by stating a practical data-related challenge.

This book also includes the computer code that provides a solution to the problem and helps illustrate the concepts behind the solution. By running the code yourself, and seeing data generation and analysis happen live, you will get a better intuition for the concepts, the mathematics, and the theory.

The book was created using the R markdown language and we make all this code available to the reader. This means that readers can replicate all the figures and analyses used to create the book.

Chan School of Public Health. For the past 17 years, Dr. Love uses statistical models to discover biologically relevant patterns in genomic datasets, and develops open-source statistical software for the Bioconductor Project. See full terms. If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them.

All readers get free updates, regardless of when they bought the book or how much they paid including free. The formats that a book includes are shown at the top right corner of this page. Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device. Learn more about Leanpub's ebook formats and where to read them.

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses! Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks. Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. In this case, the success probability also affects the appropriateness of the CLT.

Run the simulation from exercise 1, but for different values of p and n. For which of the following is the normal approximation best? As we have already seen, the CLT also applies to averages of quantitative data. In several previous exercises we have illustrated statistical concepts with the unrealistic situation of having access to the entire population. Instead, we obtain one random sample and need to reach conclusions analyzing that data. We have 12 measurements for each of two populations:.

We think of X as a random sample from the population of all mice in the control diet and Y as a random sample from the population of all mice in the high fat diet. Now we introduce the concept of a null hypothesis. What is this t-statistic? If we apply the CLT, what is the distribution of this t-statistic? Now we are ready to compute a p-value using the CLT.

What is the probability of observing a quantity as large as what we computed in 10, when the null distribution is true? CLT provides an approximation for cases in which the sample size is large. As a result, if this approximation is off, so is our p-value.

As described earlier, there is another approach that does not require a large sample size, but rather that the distribution of the population is approximately normal. If we are willing to assume this, then it follows that the t-statistic follows t- distribution. What is the p-value under the t-distribution approximation?

Hint: use the t. With the CLT distribution, we obtained a p-value smaller than 0. What best describes the difference? The t-distribution accounts for the variability introduced by the estimation of the standard error and thus, under the null, large values are more probable under the null distribution.

Both are wrong. We will now demonstrate how to obtain a p-value in practice. We begin by loading experimental data and walking you through the steps used to form a t-statistic and compute a p-value.

We can perform this task with just a few lines of code go to the end of section to see them. We start by reading in the data. A first important step is to identify which rows are associated with treatment and control, and to compute the difference in mean.

We are asked to report a p-value. What do we do? We learned that diff, referred to as the observed effect size, is a random variable. We also learned that under the null hypothesis, the mean of the distribution of diff is 0. What about the standard error? We also learned that the standard error of this random variable is the population standard deviation divided by the square root of the sample size:.

We use the sample standard deviation as an estimate of the population standard deviation. In R, we simply use the sd function and the SE is:. This is the SE of the sample average, but we actually want the SE of diff. We saw how statistical theory tells us that the variance of the difference of two random variables is the sum of its variances, so we compute the variance and take the square root:. Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.

This ratio is what we call the t-statistic. Once we know the distribution of this random variable, we can then easily compute a p-value. As explained in the previous section, the CLT tells us that for large sample sizes, both sample averages mean treatment and mean control are normal. Statistical theory tells us that the difference of two normally distributed random variables is again normal, so CLT tells us that tstat is approximately normal with mean 0 the null hypothesis and SD 1 we divided by its SE.

So now to calculate a p-value all we need to do is ask: how often does a normally distributed random variable exceed diff? R has a built-in function, pnorm, to answer this specific question. To obtain the probability that it is larger than a, we simply use 1-pnorm a. We want to know the probability of seeing something as extreme as diff: either smaller more negative than -abs diff or larger than abs diff.

In this case, the p-value is smaller than 0. Now there is a problem. CLT works for large samples, but is 12 large enough? A rule of thumb for CLT is that 30 is a large enough sample size but this is just a rule of thumb.

The p-value we computed is only a valid approximation if the assumptions hold, which do not seem to be the case here. However, there is another option other than using CLT. As described earlier, statistical theory offers another useful result. If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT. But for something like weight, we suspect that the population distribution is likely well approximated by normal and that we can use this approximation.

Furthermore, we can look at a qq-plot for the samples. Quantile-quantile plots for sample against theoretical normal distribution. If we use this approximation, then statistical theory tells us that the distribution of the random variable tstat follows a t-distribution. This is a much more complicated distribution than the normal.

The t-distribution has a location parameter like the normal and another parameter called degrees of freedom. R has a nice function that actually computes everything for us. The p-value is slightly bigger now. This is to be expected because our CLT approximation considered the denominator of tstat practically fixed with large samples it practically is , while the t- distribution approximation takes into account that the denominator the standard error of the difference is a random variable.

The smaller the sample size, the more the denominator varies. It may be confusing that one approximation gave us one p-value and another gave us another, because we expect there to be just one answer. However, this is not uncommon in data analysis. We used different assumptions, different approximations, and therefore we obtained different results. Later, in the power calculation section, we will describe type I and type II errors. As a preview, we will point out that the test based on the CLT approximation is more likely to incorrectly reject the null hypothesis a false positive , while the t-distribution is more likely to incorrectly accept the null hypothesis false negative.

The arguments to t. We have described how to compute p-values which are ubiquitous in the life sciences. However, we do not recommend reporting p-values as the only statistical summary of your results. The reason is simple: statistical significance does not guarantee scientific significance. With large enough sample sizes, one might detect a statistically significance difference in weight of, say, 1 microgram.

But is this an important finding? Would we say a diet results in higher weight if the increase is less than a fraction of a percent? The problem with reporting only p-values is that you will not provide a very important piece of information: the effect size. Recall that the effect size is the observed difference. Sometimes the effect size is divided by the mean of the control group and so expressed as a percent increase.

A much more attractive alternative is to report confidence intervals. A confidence interval includes information about your estimated effect size and the uncertainty associated with this estimate. Here we use the mice data to illustrate the concept behind confidence intervals. Before we show how to construct a confidence interval for the difference between the two groups, we will show how to construct a confidence interval for the population mean of control female mice.

We start by reading in the data and selecting the appropriate rows:. We are interested in estimating this parameter. In practice, we do not get to see the entire population so, as we did for p-values, we demonstrate how we can use samples to do this. We know this is a random variable, so the sample average will not be a perfect estimate.

In fact, because in this illustrative example we know the value of the parameter, we can see that they are not exactly the same. A confidence interval is a statistical way of reporting our finding, the sample average, in a way that explicitly summarizes the variability of our random variable. With a sample size of 30, we will use the CLT. This implies that the probability of this event:.

We can construct this interval with R relatively easily:. However, we can take another sample and we might not be as lucky.

Because we have access to the population data, we can confirm this by taking several new samples:. The color denotes if the interval fell on the parameter or not. You can run this repeatedly to see what happens.

We can confirm this with a simulation:. The confidence interval is based on the CLT approximation. This mistake affects us in the calculation of Q, which assumes a normal distribution and uses qnorm. The t-distribution might be more appropriate. All we have to do is re-run the above, but change how we calculate Q to use qt instead of qnorm.

The confidence is now based on the t-distribution approximation. Now the intervals are made bigger. We recommend that in practice confidence intervals be reported instead of p-values.

If for some reason you are required to provide p-values, or required that your results are significant at the 0.

So we can form a confidence interval with the observed difference. This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0. The same calculation can be made if we use the t-distribution instead of CLT with qt.

Note that the confidence interval for the difference d is provided by the t. We have used the example of the effects of two different diets on the weight of mice. Did we make a mistake? By not rejecting the null hypothesis, are we saying the diet has no effect? The answer to this question is no. All we can say is that we did not reject the null hypothesis. But this does not necessarily imply that the null is true.

If you are doing scientific research, it is very likely that you will have to do a power calculation at some point. In many cases, it is an ethical obligation as it can help you avoid sacrificing mice unnecessarily or limiting the number of human subjects exposed to potential risk in a study. Here we explain what statistical power means. Whenever we perform a statistical test, we are aware that we may make a mistake.

This is why our p-values are not 0. Under the null, there is always a positive, perhaps very small, but still positive chance that we will reject the null when it is true. If the p-value is 0. This error is called type I error by statisticians. A type I error is defined as rejecting the null when we should not.

This is also referred to as a false positive. So why do we then use 0. This is called a type II error or a false negative. The R code analysis above shows an example of a false negative: we did not reject the null hypothesis at the 0. Had we used a p-value cutoff of 0. However, in general, are we comfortable with a type I error rate of 1 in 4?

Usually we are not. Most journals and regulatory agencies frequently insist that results be significant at the 0. Part of the goal of this book is to give readers a good understanding of what p-values and confidence intervals are so that these choices can be judged in an informed way.

Unfortunately, in science, these cut-offs are applied somewhat mindlessly, but that topic is part of a complicated debate. Power is the probability of rejecting the null when the null is false. It also depends on the standard error of your estimates which in turn depends on the sample size and the population standard deviations. Statistical theory gives us formulas to calculate power. The pwr package performs these calculations for you.

Here we will illustrate the concepts behind power by coding up simulations in R. Suppose our sample size is:. What is our power with this particular data? We will compute this probability by re-running the exercise many times and calculating the proportion of times the null hypothesis is rejected. Specifically, we will run:. The simulation is as follows: we take a sample of size N from both control and treatment groups, we perform a t-test comparing these two, and report if the p-value is less than alpha or not.

We write a function that does this:. Here is an example of one simulation for a sample size of This explains why the t-test was not rejecting when we knew the null was false.

To guard against false positives at the 0. We will use the function sapply, which applies a function to each of the elements of a vector. We want to repeat the above for the following sample size:. For each of the three simulations, the above code returns the proportion of times we reject. Not surprisingly power increases with N:. Similarly, if we change the level alpha at which we reject, power changes. The smaller I want the chance of type I error to be, the less power I will have.

Another way of saying this is that we trade off between the two types of error. Power plotted against cut-off. Note that the x-axis in this last plot is in the log scale. To see this clearly, you could create a plot with curves of power versus N. Show several curves in the same plot with color representing alpha level. Another consequence of what we have learned about power is that p-values are somewhat arbitrary when the null hypothesis is not true and therefore the alternative hypothesis is true the difference between the population means is not zero.

When the alternative hypothesis is true, we can make a p- value as small as we want simply by increasing the sample size supposing that we have an infinite population to sample from. We can show this property of p-values by drawing larger and larger samples from our population and calculating p-values. This works because, in our case, we know that the alternative hypothesis is true, since we have access to the populations and can calculate the difference in their means.

First write a function that returns a p-value for a given sample size N :. We have a limit here of for the high-fat diet population, but we can see the effect well before we get to For each sample size, we will calculate a few p-values. We can do this by repeating each value of N a few times. The actual value of the p-values decreases as we increase sample size whenever the alternative hypothesis is true.

The standard cutoffs of 0. It is important to remember that p-values are not more interesting as they become very very small. Once we have convinced ourselves to reject the null hypothesis at a threshold we find reasonable, having an even smaller p-value just means that we sampled more mice than was necessary. Therefore, a better statistic to report is the effect size with a confidence interval or some statistic which gives the reader a sense of the change in a meaningful scale.

We can report the effect size as a percent by dividing the difference and the confidence interval by the control population mean:. This tells us how many standard deviations of the data the mean of the high-fat diet group is from the control group.

Exercises For these exercises we will load the babies dataset from babies. We will use this data to review the concepts behind the p-values and then test confidence interval concepts. This is a large dataset 1, cases , and we will pretend that it contains the entire population in which we are interested.

We will study the differences in birth weight between babies born to smoking and non-smoking mothers. Now, we can look for the true population difference in means between smoking and non-smoking birth weights. The population difference of mean birth weights is about 8. The standard deviations of the nonsmoking and smoking groups are about As we did with the mouse weight data, this assessment interactively reviews inference concepts using simulations in R.

We will treat the babies dataset as the full population and draw samples from it to simulate individual experiments. We will then ask whether somebody who only received the random samples would be able to draw correct conclusions about the population. We are interested in testing whether the birth weights of babies born to non-smoking mothers are significantly different from the birth weights of babies born to smoking mothers. Compute the t-statistic call it tval.

The standard procedure is to examine the probability a t-statistic that actually does follow the null hypothesis would have larger absolute value than the absolute value of the t-value we just observed — this is called a two-sided test. We have computed these by taking one minus the area under the standard normal curve between -abs tval and abs tval.

In R, we can do this by using the pnorm function, which computes the area under a normal curve from negative infinity up to the value given as its first argument: 3. Because of the symmetry of the standard normal distribution, there is a simpler way to calculate the probability that a t-value under the null could have a larger absolute value than tval. By reporting only p-values, many scientific publications provide an incomplete story of their findings.

As we have mentioned, with very large sample sizes, scientifically insignificant differences between two groups can lead to small p-values.

Confidence intervals are more informative as they include the estimate itself. Our estimate of the difference between babies of smoker and non-smokers: mean dat. Why are the values from 4 and 5 so similar? No matter which way you compute it, the p-value pval is the probability that the null hypothesis could have generated a t-statistic more extreme than than what we observed: tval. If the p-value is very small, this means that observing a value more extreme than tval would be very rare if the null hypothesis were true, and would give strong evidence that we should reject the null hypothesis.

We determine how small the p-value needs to be to reject the null by deciding how often we would be willing to mistakenly reject the null hypothesis. This fact is not immediately obvious and requires some probability theory to show. We will see a number of decision rules that we use in order to control the probabilities of other types of errors. Which of the following sentences about a Type I error is not true? In the simulation we have set up here, we know the null hypothesis is false — the true value of difference in means is actually around 8.

Thus, we are concerned with how often the decision rule outlined in the last section allows us to conclude that the null hypothesis is actually false. In other words, we would like to quantify the Type II error rate of the test, or the probability that we fail to reject the null hypothesis when the alternative hypothesis is true. It thus does not nail down a specific distribution for the t-value under the alternative.

For this reason, when we study the Type II error rate of a hypothesis testing procedure, we need to assume a particular effect size, or hypothetical size of the difference between population means, that we wish to target. Power is one minus the Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.

There are several aspects of a hypothesis test that affect its power for a particular effect size. This means that for an experiment with fixed parameters i. We can explore the trade off of power and Type I error concretely using the babies data. Since we have the full population, we know what the true effect size is about 8.

What is the p-value use the t-test function? The p-value is larger than 0. This is a type II error. Which of the following is not a way to decrease this type of error? Set the seed at 1, then use the replicate function to repeat the code used in exercise 9 10, times. What proportion of the time do we reject at the 0. Repeat the exercise above for samples sizes of 30, 60, 90 and Computers can be used to generate pseudo-random numbers. For practical purposes these pseudo- random numbers can be used to imitate random variables from the real world.

This permits us to examine properties of random variables using a computer instead of theoretical or analytical derivations.

One very useful aspect of this concept is that we can create simulated data to test out ideas or competing methods, without actually having to perform laboratory experiments. Simulations can also be used to check theoretical or analytical results.

Also, many of the theoretical results we use in statistics are based on asymptotics: they hold when the sample size goes to infinity. In practice, we never have an infinite number of samples so we may want to know how well the theory works with our actual sample size. Sometimes we can answer this question analytically, but not always. Simulations are extremely useful in these cases. We will build a function that automatically generates a t-statistic under the null hypothesis for a any sample size of n.

With 1, Monte Carlo simulated occurrences of this random variable, we can now get a glimpse of its distribution:. So is the distribution of this t-statistic well approximated by the normal distribution? In the next chapter, we will formally introduce quantile-quantile plots, which provide a useful visual inspection of how well one distribution approximates another.

As we will explain later, if points fall on the identity line, it means the approximation is a good one. Quantile-quantile plot comparing Monte Carlo simulated t-statistics to theoretical normal distribution. This looks like a very good approximation. For this particular population, a sample size of 10 was large enough to use the CLT approximation. How about 3? Quantile-quantile plot comparing Monte Carlo simulated t-statistics with three degrees of freedom to theoretical normal distribution.

Now we see that the large quantiles, referred to by statisticians as the tails, are larger than expected below the line on the left side of the plot and above the line on the right side of the plot.

In the previous module, we explained that when the sample size is not large enough and the population values follow a normal distribution, then the t-distribution is a better approximation.

Quantile-quantile plot comparing Monte Carlo simulated t-statistics with three degrees of freedom to theoretical t-distribution. The t-distribution is a much better approximation in this case, but it is still not perfect. This is due to the fact that the original data is not that well approximated by the normal distribution. Quantile-quantile of original data compared to theoretical quantile distribution. The technique we used to motivate random variables and the null distribution was a type of Monte Carlo simulation.

We had access to population data and generated samples at random. In practice, we do not have access to the entire population. The reason for using the approach here was for educational purposes. However, when we want to use Monte Carlo simulations in practice, it is much more typical to assume a parametric distribution and generate a population from this, which is called a parametric simulation.

This means that we take parameters estimated from the real data here the mean and the standard deviation , and plug these into a model here the normal distribution. In this XSeries, you will gain the tools to analyze and interpret life sciences data. You will learn the basic statistical concepts and R programming skills necessary for analyzing real data. R is a free open-source statistical software and is the most widely used data analysis platforms among academic statisticians.

Taught by Rafael Irizarry from the Harvard T. Chan School of Public Health, who for the past 15 years has focused on the analysis of genomics data, this XSeries is perfect for anyone in the life sciences who wants to learn how to analyze data. Problem sets will require coding in the R language to ensure learners fully grasp and master key concepts.

Data Analysis for Life Sciences. This book also includes the computer code that provides a solution to the problem and helps illustrate the concepts behind the solution.

By running the code yourself, and seeing data generation and analysis happen live, you will get a better intuition for the concepts, the mathematics, and the theory.

A free PDF copy is available from Leanpub. The book was created using the R markdown language and we make all this code available from GitHub.



0コメント

  • 1000 / 1000