Create scatter plots and analyze correlations between two variables with regression analysis.
Correlation coefficient (r)
0.9512
Very strong positive correlation
As x increases, y tends to increase.
Regression analysis
Line of best fit
y = 0.9048x + 1.4286
Slope (m)
0.9048
Y-intercept (b)
1.4286
Correlation statistics
Correlation coefficient (r)
0.9512
R-squared (R²)
0.9048
Coefficient of determination
90.4762%
Descriptive statistics
Number of points (n)
8
Mean of x
4.5
Mean of y
5.5
Min x
1
Max x
8
Min y
2
Max y
9
R² = 0.9048 means that 90.4762% of the variance in y can be explained by the linear relationship with x.
What is a scatter plot?
A scatter plot (also called a scatter graph, scatter diagram, or XY plot) is a type of data visualization that displays the relationship between two numerical variables. Each data point is represented as a dot positioned according to its x and y coordinates. By examining the pattern of dots, you can identify correlations, trends, and outliers in your data.
Scatter plots are fundamental tools in statistics, science, and business analytics. They help answer questions like: Does studying more lead to higher test scores? Is there a relationship between advertising spend and sales? Do taller people tend to weigh more?
How to read a scatter plot
When analyzing a scatter plot, look for these key patterns:
Pattern
Description
Correlation
Upward slope (left to right)
As x increases, y increases
Positive
Downward slope (left to right)
As x increases, y decreases
Negative
No clear pattern
Points scattered randomly
None or weak
Tight clustering around a line
Strong linear relationship
Strong
Wide spread around trend
Weak linear relationship
Weak
Positive correlation
When the dots form a pattern that rises from left to right, the variables have a positive correlation. Examples include:
Hours studied vs. exam scores
Exercise frequency vs. fitness level
Experience years vs. salary
Negative correlation
When the dots form a pattern that falls from left to right, the variables have a negative correlation. Examples include:
Price vs. quantity demanded
Age of car vs. resale value
Distance from city center vs. property price
No correlation
When dots are scattered randomly with no discernible pattern, there may be no linear relationship between the variables.
The correlation coefficient (r)
The Pearson correlation coefficient, denoted as r, quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to +1:
Linear regression finds the straight line that best fits your data points. This line, also called the trend line or line of best fit, is expressed as:
y=mx+b
Where:
y is the predicted value
x is the independent variable
m is the slope (how much y changes for each unit change in x)
b is the y-intercept (the value of y when x = 0)
Calculating slope and intercept
The least squares method minimizes the sum of squared vertical distances between data points and the line. The formulas are:
mb=n∑x2−(∑x)2n∑xy−∑x∑y=yˉ−mxˉ
R-squared: coefficient of determination
R-squared (R²) tells you what percentage of the variance in the dependent variable (y) is explained by the independent variable (x). It's simply the square of the correlation coefficient:
R2=r2
For example:
If r = 0.8, then R² = 0.64, meaning 64% of the variance in y is explained by x
If r = -0.9, then R² = 0.81, meaning 81% of the variance in y is explained by x
Interpreting R-squared
R² value
Interpretation
0.9 - 1.0
Excellent fit
0.7 - 0.9
Good fit
0.5 - 0.7
Moderate fit
0.3 - 0.5
Weak fit
0 - 0.3
Poor fit
How to create an effective scatter plot
Step 1: Prepare your data
Organize your data into pairs of x and y values. Ensure:
Both variables are numerical
Each observation has both an x and y value
Data is free of errors or outliers that might skew results
Step 2: Choose your axes
Place the independent variable (the one you think might cause changes) on the x-axis
Place the dependent variable (the one you think might be affected) on the y-axis
Use appropriate scales that show the full range of data without excessive empty space
Step 3: Plot the points
Mark each data point at its corresponding (x, y) position. Use consistent symbols (usually circles) for all points.
Step 4: Add a trend line (optional)
If the data shows a clear linear pattern, add a regression line to visualize the trend. This helps with:
Making predictions
Understanding the rate of change
Communicating the relationship to others
Common mistakes to avoid
Correlation does not imply causation
Just because two variables are correlated doesn't mean one causes the other. There could be:
A third variable causing both
Coincidental correlation with no real connection
Reverse causation (y might cause x, not the other way around)
For example, ice cream sales and drowning rates are positively correlated, but ice cream doesn't cause drowning. Both increase in summer due to hot weather.
Ignoring outliers
Outliers can dramatically affect correlation coefficients and regression lines. Always:
Investigate unusual points
Determine if they're errors or legitimate extreme values
Consider their impact on your analysis
Assuming linearity
Not all relationships are linear. Some patterns are:
Exponential (rapid growth or decay)
Logarithmic (rapid initial change that levels off)
Polynomial (curved relationships)
If your scatter plot shows a curved pattern, linear regression may not be appropriate.
Extrapolating beyond the data
Making predictions outside the range of your data is risky. A linear relationship observed between x = 10 and x = 50 may not hold for x = 100 or x = 200.
Practical applications
Business and marketing
Analyzing relationship between ad spend and revenue
Studying price sensitivity and demand
Correlating customer satisfaction with retention rates
Science and research
Studying relationships between variables in experiments
Identifying trends in environmental data
Analyzing medical data for treatment effectiveness
Education
Examining factors that affect student performance
Studying the relationship between study time and grades
Analyzing attendance and academic outcomes
Finance
Analyzing stock correlations for portfolio diversification
Studying the relationship between economic indicators
Risk assessment and modeling
Limitations of scatter plots
Only shows two variables - Real-world phenomena often involve multiple factors
Assumes continuous data - Not suitable for categorical variables
Can be misleading with large datasets - Points may overlap, hiding patterns
Doesn't show causation - Only reveals association between variables
Sensitive to outliers - Extreme values can distort the apparent relationship
Tips for better analysis
Always visualize your data before running statistical tests
Look for patterns beyond simple linear relationships
Consider the context and domain knowledge when interpreting results
Report both the correlation coefficient and R-squared for complete information
Be cautious about making predictions, especially outside your data range
Use multiple data points (at least 10-20) for reliable correlation estimates