4 Learning to use R statistical software for data mining — Putting it all together
Provided by Kenan Fellows Program.

Students will mine data to identify the variables that most significantly impact the selling price of a home.
In Algebra I, students study data sets with one predictor variable and one response variable. However, in the real world, most response variables have numerous predictor variables, many of which may have a significant impact on the data. These different variables may also have differing effects on the situation at hand, so it is important to identify their effects and then use them appropriately to make sound, more valid predictions.
In this lesson, students will create their own data set and then use the free R statistical software to mine their data in an attempt to identify the variables that most significantly impact the selling price of a home. Students will utilize multiple methods of variable selection — forward selection, backward selection, and stepwise selection — in an attempt to determine which variables are most influential.
Learning outcomes
At the end of this lesson, students should be able to use R to:
- Identify influential variables in a multivariate data set using forward selection, backward selection, and stepwise selection
- Develop linear models that can be used to make predictions
Students should also be able to:
- Explain the purpose of data mining
- Understand the basic principles behind the forward, backward, and stepwise selection processes
- Identify some real-world uses of data mining
- Use linear models to make predictions
Teacher planning
Time Required
To complete this lesson, one block period or two traditional periods (for a total of approximately ninety minutes) would be necessary. This is the third lesson in a set of three.
Student handouts
- The basics of R

- The Basics of R handout
- Open as PDF (23 KB, 1 page; also available as Microsoft Word document)
- Putting it all together: Worksheet 3

- Putting It All Together: Worksheet 3
- Open as PDF (31 KB, 3 pages; also available as Microsoft Word document)
Technology resources
- One computer per student with internet access
- One computer with projector for teacher
- R statistical software
- Microsoft Excel
- MLS Search Engine (I recommend the Fonville-Morisey website for homes in the Triangle of North Carolina.)
- Printing capability
Pre-activities
Prior to beginning this lesson, students should be able to:
- Distinguish between independent and dependent variables
- Create scatter plots and describe correlation
- Use a graphing calculator to find linear regression models for data sets
- Use R statistical software to create scatter plots, to calculate linear models, to determine correlation of variables, and to complete basic data mining tasks
Prior to implementing this lesson, teachers should:
- Become familiar with basic R commands
- Develop data sets that may be used for demonstration purposes, if necessary
- Ensure that R (a free software) and Microsoft Excel are installed on student computers
- Understand the basics of variable selection
Activities
Distribute the Putting it All Together worksheet to students. Lead students through the overall outline of their assignment. It may be helpful to go through the initial search engine setup with students so they can easily find the information they will need to create their own set of data. At this point in the instructional unit, students should be able to use their previous lesson worksheets to guide them through this assignment independently. At the end of the lesson, students should print a copy of their R script (including all commands used) and attach it to their worksheets.
Assessment
To assess students’ understanding of the material, you could:
- Collect the lesson packets and check students’ data tables, short answer responses, and R scripts with code
- Ask students to develop their own ideas of other situations in which it would be helpful to identify the most influential variables (other times when data mining could be useful)
- Ask students to describe the usefulness of R
Modifications
This activity can be used in almost any type of classroom setting. It may be helpful to pair students with learning disabilities or English language learners with higher-achieving students. Also, identify helpers — those students who quickly catch on to how the R environment works and can help you keep other students on track. One teacher trying to help thirty students write programming code can be a frustrating experience for everyone involved.
Critical vocabulary
- data mining
- a method commonly used to extract useful information from large sets of data
- forward selection
- a method of variable selection in which no variables are initially considered, but then are added if found to be relevant to a given situation
- backward selection
- a method of variable selection in which all variables are initially considered, but then are eliminated if found to be irrelevant to a given situation
- stepwise selection
- a method of variable selection similar to forward selection, but
variables that may initially be added may be later deleted in an attempt to identify the most relevant variables - analysis of variance
- technique used to identify the effect of variables on a response
Comments
This instructional unit was created as part of the Kenan Fellows Program as part of a goal to create innovative curricula for K–12 teachers to use in classrooms across the United States. My work in data mining was supported by my mentors, Dr. Hao Helen Zhang and Dr. Yufeng Liu. Dr. Zhang is a professor of statistics at North Carolina State University in Raleigh, North Carolina. Dr. Liu is a professor of statistics at the University of North Carolina in Chapel Hill, North Carolina.
North Carolina curriculum alignment
Mathematics (2004)
Grade 9–12 — Advanced Placement Statistics
- Goal 4: Algebra - The learner will analyze bivariate data to solve problems.
- Objective 4.01: Analyze bivariate data.
- Recognize and analyze correlation and linearity.
- Determine the least squares regression line.
- Create residual plots and identify outliers and influential points to analyze data.
- Use logarithmic and power transformations to analyze data.
- Objective 4.01: Analyze bivariate data.
Grade 9–12 — Algebra 1
- Goal 3: Data Analysis and Probability - The learner will collect, organize, and interpret data with matrices and linear models to solve problems.
- Objective 3.03: Create linear models for sets of data to solve problems.
- Interpret constants and coefficients in the context of the data.
- Check the model for goodness-of-fit and use the model, where appropriate, to draw conclusions or make predictions.
- Objective 3.03: Create linear models for sets of data to solve problems.
Grade 9–12 — Algebra 2
- Goal 2: Algebra - The learner will use relations and functions to solve problems.
- Objective 2.04: Create and use best-fit mathematical models of linear, exponential, and quadratic functions to solve problems involving sets of data.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check the model for goodness-of-fit and use the model, where appropriate, to draw conclusions or make predictions.
- Objective 2.04: Create and use best-fit mathematical models of linear, exponential, and quadratic functions to solve problems involving sets of data.
Grade 9–12 — Discrete Mathematics
- Goal 1: Number and Operations - The learner will use matrices and graphs to model relationships and solve problems.
- Objective 1.01: Use matrices to model and solve problems.
- Display and interpret data.
- Write and evaluate matrix expressions to solve problems.
- Objective 1.01: Use matrices to model and solve problems.
Grade 9–12 — Integrated Mathematics 1
- Goal 3: Data Analysis and Probability - The learner will analyze data and apply probability concepts to solve problems.
- Objective 3.03: Create linear and exponential models, for sets of data, to solve problems.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check the model for goodness-of-fit and use the model, where appropriate, to draw conclusions or make predictions.
- Objective 3.03: Create linear and exponential models, for sets of data, to solve problems.
Grade 9–12 — Integrated Mathematics 2
- Goal 3: Data Analysis and Probability - The learner will collect, organize, and interpret data to solve problems.
- Objective 3.02: Create and use, for sets of data, calculator-generated models of linear, exponential, and quadratic functions to solve problems.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check the model for goodness-of-fit and use the model, where appropriate, to draw conclusions or make predictions.
- Objective 3.02: Create and use, for sets of data, calculator-generated models of linear, exponential, and quadratic functions to solve problems.
Grade 9–12 — Integrated Mathematics 4
- Goal 3: Data Analysis and Probability - The learner will analyze data to solve problems.
- Objective 3.02: Create and use calculator-generated models of linear, polynomial, exponential, trigonometric, power, logistic, and logarithmic functions of bivariate data to solve problems.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check models for goodness-of-fit; use the most appropriate model to draw conclusions or make predictions.
- Objective 3.02: Create and use calculator-generated models of linear, polynomial, exponential, trigonometric, power, logistic, and logarithmic functions of bivariate data to solve problems.
Grade 9–12 — Pre-Calculus
- Goal 2: Algebra - The learner will use relations and functions to solve problems.
- Objective 2.03: For sets of data, create and use calculator-generated models of linear, polynomial, exponential, trigonometric, power, logistic, and logarithmic functions.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check models for goodness-of-fit; use the most appropriate model to draw conclusions or make predictions.
- Objective 2.03: For sets of data, create and use calculator-generated models of linear, polynomial, exponential, trigonometric, power, logistic, and logarithmic functions.
Grade 9–12 — Technical Mathematics 2
- Goal 2: Algebra - The learner will use relations and functions to solve problems.
- Objective 2.03: Create, interpret, and analyze best-fit models of linear, exponential, and quadratic functions to solve problems.
- Interpret the constants, coefficients, and bases in the context of the data.
- Check the model for goodness-of-fit and use the model, where appropriate, to draw conclusions or make predictions.
- Objective 2.03: Create, interpret, and analyze best-fit models of linear, exponential, and quadratic functions to solve problems.




