Modern Statistical Thinking for Biologists

Modern statistical thinking for biologists (online 31 Aug - 14 Dec 2023)

This beginners' course will build up your intuition about different types of data, and the best way to analyse them. It follows an innovative didactic approach, designed to match students' natural intuitions about scientific questions.

Interested in this course? Email us at [email protected]

Registration has closed.

Add your email to the mailing list to stay informed of future courses and events.

This is the pre-course seminar that took place on 30 June 2023. Check it out to find out more about the content and format of this course. Click on the full-screen button at the bottom right to see the video in.

A course in statistics is one of the most ubiquitous elements of training for researchers in biology and biomedicine. Despite this, many scientists struggle enormously when they need to analyse their own data. At its worst, data analysis becomes nothing more than a dull exercise in pressing buttons in a statistics package, with constant nagging doubt as to what the buttons really do and whether they are the right buttons to press.

These difficulties are partly linked to the way many introductory statistics courses are taught, with focus on memorising lists of tests rather than on conceptual understanding, and with few opportunities to practice on real-life data sets. The problem is compounded by biomedicine’s unhealthy fixation on P-values – a concept that is so unintuitive that it is frequently misunderstood and misapplied.

In this course, we will learn introductory data analysis in a manner that is closer to the way that professional statisticians think about data. We will focus on obtaining a good conceptual understanding of common types of analyses, and will apply them to real data using R. The maths will be kept to an absolute minimum. As a result, the course has no pre-requisites in terms of maths or statistics skills.

Unlike in most introductory courses, we will learn not only frequentist tools, like P-values, but also Bayesian methods. Bayesian statistics is a powerful approach to data analysis that is becoming exceedingly common in biology and biomedicine. Compared to more traditional approaches, it also more closely follows the way that scientists intuitively think about their research questions. As a result, many students find that Bayesian concepts are easier to learn.

We will learn using a combination of short lectures, group discussions and hands-on activities on real data. There will also be weekly individual assignments. The assignments are a crucial component of the course because you will receive individual written feedback each time, to keep track of your individual progress. The shortest assignments should take you just around 15 minutes, whereas the longest may take up to 2 hours to complete. In addition, we will dedicate one session to a journal club where we will practice reading real scientific papers that use Bayesian methods. Finally, two sessions - one towards the middle of the course and one at the end - have been set aside for group projects, where the students will apply everything they have learned to new, unseen data sets.

The course is divided into three large units. In the first, we will learn how to visualise and summarise data in such a way as to not miss important scientific insights. In the second, we will build up our statistical thinking skills through the framework of Bayesian statistics. Finally, in the third unit, we will learn about statistical significance and will apply common statistical tests.

Although this course is open to students and researchers from all the natural and social sciences, we will focus on data sets and types of analyses that are particularly relevant to biology and biomedicine.

After completing this course, you will be able to…

…use data visualisation and clustering techniques to detect important patterns in your data.

...recognise different types of data (e.g. counts or percentages) and navigate the difficulties inherent to the analysis of each type.

…draw inferences based on your sample data. For instance, if in a sample of 50 individuals, 10 were infected with COVID-19 at some point during the last year, then what can you conclude about the prevalence of COVID-19 in the whole population? And how much confidence can you have in this conclusion?

…use regression modelling to study relationships between variables.

…understand what is meant by Bayesian statistics, and how this differs from classical statistics.

…interpret P-values appropriately, and avoid common pitfalls associated to the use of P-values.

…conduct and understand common parametric and non-parametric statistical tests.

Format: weekly 2.5-hour Zoom sessions on Thursdays, from 3pm to 5.30pm Lisbon time.

Pre-requisites: The course requires very basic skills in R. If you have no previous R experience, you should complete this free online R course, or a similar course, prior to the first session. The free course shouldn’t take you more than a few hours to do, and comes with a discussion forum, where you can always ask for help.

15 spots are available on a first-come first-served basis. The course content is under constant development, and so the final syllabus may differ slightly from that shown here. If you have any questions, don’t hesitate to drop me a line on [email protected].

Course curriculum

1. Welcome!
2. Try out the discussion forum
3. Practical details
4. Installing R
5. Installing RStudio
6. Just to get to know you a bit better...
7. Assignment durations
8. Text books
9. TO DO FOR SESSION 1: Download data files
1. Data detectives: why are these graphs misleading?
2. Why is data visualisation important?
3. Describing the properties of human genes.
4. Mean and median: what is a typical value for this variable?
5. A few common types of graphs.
6. Standard deviation, variance, interquartile range: how much variability is there around the typical value?
7. EXERCICE: First group activity
8. EXERCICE: Practice summary statistics
9. Unit 1.1 code
10. Recording 31st August 2023
11. Unit 1.1. slides
12. Extra materials 1.1.
13. TO DO FOR UNIT 1.2: download data and install packages
14. Quiz 1.1
15. Assignment 1.1
1. Can biomarker profiles be used to subgroup COVID-19 patients?
2. Exercice: Explore data
3. Biomarker abbreviations
4. Hierarchical clustering and heatmaps.
5. Exercice: the effect of the linkage method
6. Principal Component Analysis (PCA): which properties are the most useful for explaining the variability between patients?
7. Unit 1.2. code
8. Unit 1.2. slides
9. TO DO FOR 1.3: download file
10. Extra materials for Unit 1.2.
11. Quiz 1.2
12. Recording 7 Sep 2023
13. Assignment 1.2
1. Biological data comes in many flavours.
2. Representing counts through discrete probability distributions.
3. Representing other kinds of variables through continuous probability distributions.
4. Exercice: match each variable with its description!
5. Unit 1.3. slides
6. Unit 1.3. code
7. TO DO FOR UNIT 1.4: download data file
8. Extra materials 1.3
9. Recording 14 Sep 2023
10. Quiz 1.3
11. Assignment 1.3.
12. Quiz 1.3 (2)
13. Recording 21 Sep 2023
1. What do we mean when we talk about "estimating a parameter"?
2. What is the true population sex ratio in possums? The Bayesian logic for figuring out which parameter values are likely.
3. EXERCICE: The effect of the prior
4. A bit of history: what is "Bayesian" statistics and why is it usually not taught in introductory courses?
5. TO DO FOR NEXT WEEK: download female heights
6. TO DO FOR NEXT WEEK: install packages
7. Assignment 2.1
8. Unit 2.1. slides
9. Recording 28 Sep 2023
10. Unit 2.1. code
11. Extra materials 2.1.
12. Quiz 2.1.
1. How tall is the typical American woman? And how much variability do we expect around that typical value?
2. EXERCICE: Both of our parameters need a prior
3. Thinking about our problem as a model.
4. Markov Chain Monte Carlo: a clever tool for parameter estimation.
5. EXERCICE: explore the posterior distribution
6. Quantifying our uncertainty about the likely parameter values.
7. TO DO: get heights and weights
8. Assignment 2.2
9. Unit 2.2. slides
10. Unit 2.2. code
11. Leave your e-mails!
12. Extra materials 2.2.
13. Quiz 2.2.
14. Assignment 2.2. (2)
15. Recording 5 Oct 2023
1. Capturing the relationship between two variables: can we predict a person's weight from their height?
2. Predicting new data based on the estimated parameters.
3. EXERCICE: Different models
4. EXERCICE: Inspecting the results
5. Unit 2.3. code
6. Recording 12 October 2023
7. Unit 2.3. slides
8. TO DO FOR 2.4: download data
9. Quiz 2.3.
10. Assignment 2.3
1. What determines the price of legos?
2. Data detectives: what's wrong with this graph?
3. Dealing with numerical and categorical predictor variables.
4. Interactions between predictors
5. EXERCICE: Posterior predictions
6. Unit 2.4. slides
7. Unit 2.4. code
8. Recording 19 Oct 2023
9. TO DO FOR 2.5: download files
10. Extra materials 2.4.
11. Quiz 2.4
12. Assignment 2.4
1. Inflammation and schizophrenia (case study of RNA-seq data)
2. Predicting COVID-19 survival (case study of biomarker data)
3. EXERCICE: Modelling gene expression
4. Unit 2.5. slides
5. Recording 26 Oct 2023
6. Unit 2.5. code
7. Extra materials 2.5.
8. Quiz 2.5.
9. Assignment 2.5
1. Why build several models for the same problem?
2. Comparing between models.
3. Unit 2.6. slides
4. Extra materials 2.6.
5. Quiz 2.6.
6. Assignment 2.6
1. Reading and discussing real research papers that use Bayesian methods.
2. EXERCICE: Journal club
3. Recording 9 Nov 2023
4. Assignment 2.7.
1. The students apply the concepts and methods learned to a new data set.
2. "Heart project" files
3. "Travel speed project" files
4. Recording 16 Nov
5. Assignment 2.8.
1. What is frequentist statistics?
2. Rock-paper-scissors: a game of luck or a game of skill?
3. Standard error and confidence intervals.
4. P-values, and why they are useful. P-values, and why they are dangerous.
5. Recording 23 November 2023
6. Unit 3.1. slides
7. Unit 3.1. code
8. Extra materials 3.1.
9. Quiz 3.1.
10. Assignment 3.1
1. The normal model: normal linear regression, t-test and ANOVA.
2. Download data
3. Frequentist linear regression
4. Non-parametric alternatives.
5. Testing proportions: binomial test, Fisher's Exact Test (FET), chi-squared test.
6. Unit 3.2. slides
7. Recording 30 November 2023
8. TO DO: install package
9. Quiz 3.2.
10. Assignment 3.2.
11. EXERCICE: Practice frequentist linear regression some more
12. Unit 3.2. code
13. Recording 7 Dec 2023 (Part I)
14. Recording 7 Dec 2023 (Part II)
15. Quiz 3.2. (II)
16. TO DO: install package
17. Assignment 3.2. (2)
18. EXERCICE: Central Limit Theorem
19. Temperature data
20. Extra materials 3.2.
1. Application of concepts and tools learned to a new dataset
2. Feedback
3. Your thoughts on the course
4. Assignment 3.3.

About this course

€400,00
157 lessons
34.5 hours of video content