What is a scatter plot?

A scatter plot (also called a scatter graph, scatter diagram, or XY plot) is a type of data visualization that displays the relationship between two numerical variables. Each data point is represented as a dot positioned according to its x and y coordinates. By examining the pattern of dots, you can identify correlations, trends, and outliers in your data.

Scatter plots are fundamental tools in statistics, science, and business analytics. They help answer questions like: Does studying more lead to higher test scores? Is there a relationship between advertising spend and sales? Do taller people tend to weigh more?

How to read a scatter plot

When analyzing a scatter plot, look for these key patterns:

Pattern	Description	Correlation
Upward slope (left to right)	As x increases, y increases	Positive
Downward slope (left to right)	As x increases, y decreases	Negative
No clear pattern	Points scattered randomly	None or weak
Tight clustering around a line	Strong linear relationship	Strong
Wide spread around trend	Weak linear relationship	Weak

Positive correlation

When the dots form a pattern that rises from left to right, the variables have a positive correlation. Examples include:

Hours studied vs. exam scores
Exercise frequency vs. fitness level
Experience years vs. salary

Negative correlation

When the dots form a pattern that falls from left to right, the variables have a negative correlation. Examples include:

Price vs. quantity demanded
Age of car vs. resale value
Distance from city center vs. property price

No correlation

When dots are scattered randomly with no discernible pattern, there may be no linear relationship between the variables.

The correlation coefficient (r)

The Pearson correlation coefficient, denoted as r, quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to +1:

r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}

Where:

$x_i$ and $y_i$ are individual data points
$\bar{x}$ and $\bar{y}$ are the means of x and y
$n$ is the number of data points

Interpreting correlation strength

r value	Interpretation
0.9 to 1.0	Very strong positive
0.7 to 0.9	Strong positive
0.5 to 0.7	Moderate positive
0.3 to 0.5	Weak positive
-0.3 to 0.3	Very weak or none
-0.5 to -0.3	Weak negative
-0.7 to -0.5	Moderate negative
-0.9 to -0.7	Strong negative
-1.0 to -0.9	Very strong negative

Linear regression and the line of best fit

Linear regression finds the straight line that best fits your data points. This line, also called the trend line or line of best fit, is expressed as:

y = mx + b

Where:

$y$ is the predicted value
$x$ is the independent variable
$m$ is the slope (how much y changes for each unit change in x)
$b$ is the y-intercept (the value of y when x = 0)

Calculating slope and intercept

The least squares method minimizes the sum of squared vertical distances between data points and the line. The formulas are:

\begin{aligned} m &= \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2} \\[0.5em] b &= \bar{y} - m\bar{x} \end{aligned}

R-squared: coefficient of determination

R-squared (R²) tells you what percentage of the variance in the dependent variable (y) is explained by the independent variable (x). It's simply the square of the correlation coefficient:

R^2 = r^2

For example:

If r = 0.8, then R² = 0.64, meaning 64% of the variance in y is explained by x
If r = -0.9, then R² = 0.81, meaning 81% of the variance in y is explained by x

Interpreting R-squared

R² value	Interpretation
0.9 - 1.0	Excellent fit
0.7 - 0.9	Good fit
0.5 - 0.7	Moderate fit
0.3 - 0.5	Weak fit
0 - 0.3	Poor fit

How to create an effective scatter plot

Step 1: Prepare your data

Organize your data into pairs of x and y values. Ensure:

Both variables are numerical
Each observation has both an x and y value
Data is free of errors or outliers that might skew results

Step 2: Choose your axes

Place the independent variable (the one you think might cause changes) on the x-axis
Place the dependent variable (the one you think might be affected) on the y-axis
Use appropriate scales that show the full range of data without excessive empty space

Step 3: Plot the points

Mark each data point at its corresponding (x, y) position. Use consistent symbols (usually circles) for all points.

Step 4: Add a trend line (optional)

If the data shows a clear linear pattern, add a regression line to visualize the trend. This helps with:

Making predictions
Understanding the rate of change
Communicating the relationship to others

Common mistakes to avoid

Correlation does not imply causation

Just because two variables are correlated doesn't mean one causes the other. There could be:

A third variable causing both
Coincidental correlation with no real connection
Reverse causation (y might cause x, not the other way around)

For example, ice cream sales and drowning rates are positively correlated, but ice cream doesn't cause drowning. Both increase in summer due to hot weather.

Ignoring outliers

Outliers can dramatically affect correlation coefficients and regression lines. Always:

Investigate unusual points
Determine if they're errors or legitimate extreme values
Consider their impact on your analysis

Assuming linearity

Not all relationships are linear. Some patterns are:

Exponential (rapid growth or decay)
Logarithmic (rapid initial change that levels off)
Polynomial (curved relationships)

If your scatter plot shows a curved pattern, linear regression may not be appropriate.

Extrapolating beyond the data

Making predictions outside the range of your data is risky. A linear relationship observed between x = 10 and x = 50 may not hold for x = 100 or x = 200.

Practical applications

Business and marketing

Analyzing relationship between ad spend and revenue
Studying price sensitivity and demand
Correlating customer satisfaction with retention rates

Science and research

Studying relationships between variables in experiments
Identifying trends in environmental data
Analyzing medical data for treatment effectiveness

Education

Examining factors that affect student performance
Studying the relationship between study time and grades
Analyzing attendance and academic outcomes

Finance

Analyzing stock correlations for portfolio diversification
Studying the relationship between economic indicators
Risk assessment and modeling

Limitations of scatter plots

Only shows two variables - Real-world phenomena often involve multiple factors
Assumes continuous data - Not suitable for categorical variables
Can be misleading with large datasets - Points may overlap, hiding patterns
Doesn't show causation - Only reveals association between variables
Sensitive to outliers - Extreme values can distort the apparent relationship

Tips for better analysis

Always visualize your data before running statistical tests
Look for patterns beyond simple linear relationships
Consider the context and domain knowledge when interpreting results
Report both the correlation coefficient and R-squared for complete information
Be cautious about making predictions, especially outside your data range
Use multiple data points (at least 10-20) for reliable correlation estimates

Scatter Plot Maker

Regression analysis

Correlation statistics

Descriptive statistics