Statistics

Scatter Plot Maker

Create scatter plots and analyze correlations between two variables with regression analysis.

Correlation coefficient (r)
0.9512

Very strong positive correlation

As x increases, y tends to increase.

Regression analysis

Line of best fit
y = 0.9048x + 1.4286
Slope (m)
0.9048
Y-intercept (b)
1.4286

Correlation statistics

Correlation coefficient (r)
0.9512
R-squared (R²)
0.9048
Coefficient of determination
90.4762%

Descriptive statistics

Number of points (n)
8
Mean of x
4.5
Mean of y
5.5
Min x
1
Max x
8
Min y
2
Max y
9

R² = 0.9048 means that 90.4762% of the variance in y can be explained by the linear relationship with x.

What is a scatter plot?

A scatter plot (also called a scatter graph, scatter diagram, or XY plot) is a type of data visualization that displays the relationship between two numerical variables. Each data point is represented as a dot positioned according to its x and y coordinates. By examining the pattern of dots, you can identify correlations, trends, and outliers in your data.

Scatter plots are fundamental tools in statistics, science, and business analytics. They help answer questions like: Does studying more lead to higher test scores? Is there a relationship between advertising spend and sales? Do taller people tend to weigh more?

How to read a scatter plot

When analyzing a scatter plot, look for these key patterns:

PatternDescriptionCorrelation
Upward slope (left to right)As x increases, y increasesPositive
Downward slope (left to right)As x increases, y decreasesNegative
No clear patternPoints scattered randomlyNone or weak
Tight clustering around a lineStrong linear relationshipStrong
Wide spread around trendWeak linear relationshipWeak

Positive correlation

When the dots form a pattern that rises from left to right, the variables have a positive correlation. Examples include:

  • Hours studied vs. exam scores
  • Exercise frequency vs. fitness level
  • Experience years vs. salary

Negative correlation

When the dots form a pattern that falls from left to right, the variables have a negative correlation. Examples include:

  • Price vs. quantity demanded
  • Age of car vs. resale value
  • Distance from city center vs. property price

No correlation

When dots are scattered randomly with no discernible pattern, there may be no linear relationship between the variables.

The correlation coefficient (r)

The Pearson correlation coefficient, denoted as r, quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to +1:

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}

Where:

  • xix_i and yiy_i are individual data points
  • xˉ\bar{x} and yˉ\bar{y} are the means of x and y
  • nn is the number of data points

Interpreting correlation strength

r valueInterpretation
0.9 to 1.0Very strong positive
0.7 to 0.9Strong positive
0.5 to 0.7Moderate positive
0.3 to 0.5Weak positive
-0.3 to 0.3Very weak or none
-0.5 to -0.3Weak negative
-0.7 to -0.5Moderate negative
-0.9 to -0.7Strong negative
-1.0 to -0.9Very strong negative

Linear regression and the line of best fit

Linear regression finds the straight line that best fits your data points. This line, also called the trend line or line of best fit, is expressed as:

y=mx+by = mx + b

Where:

  • yy is the predicted value
  • xx is the independent variable
  • mm is the slope (how much y changes for each unit change in x)
  • bb is the y-intercept (the value of y when x = 0)

Calculating slope and intercept

The least squares method minimizes the sum of squared vertical distances between data points and the line. The formulas are:

m=nxyxynx2(x)2b=yˉmxˉ\begin{aligned} m &= \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2} \\[0.5em] b &= \bar{y} - m\bar{x} \end{aligned}

R-squared: coefficient of determination

R-squared (R²) tells you what percentage of the variance in the dependent variable (y) is explained by the independent variable (x). It's simply the square of the correlation coefficient:

R2=r2R^2 = r^2

For example:

  • If r = 0.8, then R² = 0.64, meaning 64% of the variance in y is explained by x
  • If r = -0.9, then R² = 0.81, meaning 81% of the variance in y is explained by x

Interpreting R-squared

R² valueInterpretation
0.9 - 1.0Excellent fit
0.7 - 0.9Good fit
0.5 - 0.7Moderate fit
0.3 - 0.5Weak fit
0 - 0.3Poor fit

How to create an effective scatter plot

Step 1: Prepare your data

Organize your data into pairs of x and y values. Ensure:

  • Both variables are numerical
  • Each observation has both an x and y value
  • Data is free of errors or outliers that might skew results

Step 2: Choose your axes

  • Place the independent variable (the one you think might cause changes) on the x-axis
  • Place the dependent variable (the one you think might be affected) on the y-axis
  • Use appropriate scales that show the full range of data without excessive empty space

Step 3: Plot the points

Mark each data point at its corresponding (x, y) position. Use consistent symbols (usually circles) for all points.

Step 4: Add a trend line (optional)

If the data shows a clear linear pattern, add a regression line to visualize the trend. This helps with:

  • Making predictions
  • Understanding the rate of change
  • Communicating the relationship to others

Common mistakes to avoid

Correlation does not imply causation

Just because two variables are correlated doesn't mean one causes the other. There could be:

  • A third variable causing both
  • Coincidental correlation with no real connection
  • Reverse causation (y might cause x, not the other way around)

For example, ice cream sales and drowning rates are positively correlated, but ice cream doesn't cause drowning. Both increase in summer due to hot weather.

Ignoring outliers

Outliers can dramatically affect correlation coefficients and regression lines. Always:

  • Investigate unusual points
  • Determine if they're errors or legitimate extreme values
  • Consider their impact on your analysis

Assuming linearity

Not all relationships are linear. Some patterns are:

  • Exponential (rapid growth or decay)
  • Logarithmic (rapid initial change that levels off)
  • Polynomial (curved relationships)

If your scatter plot shows a curved pattern, linear regression may not be appropriate.

Extrapolating beyond the data

Making predictions outside the range of your data is risky. A linear relationship observed between x = 10 and x = 50 may not hold for x = 100 or x = 200.

Practical applications

Business and marketing

  • Analyzing relationship between ad spend and revenue
  • Studying price sensitivity and demand
  • Correlating customer satisfaction with retention rates

Science and research

  • Studying relationships between variables in experiments
  • Identifying trends in environmental data
  • Analyzing medical data for treatment effectiveness

Education

  • Examining factors that affect student performance
  • Studying the relationship between study time and grades
  • Analyzing attendance and academic outcomes

Finance

  • Analyzing stock correlations for portfolio diversification
  • Studying the relationship between economic indicators
  • Risk assessment and modeling

Limitations of scatter plots

  1. Only shows two variables - Real-world phenomena often involve multiple factors
  2. Assumes continuous data - Not suitable for categorical variables
  3. Can be misleading with large datasets - Points may overlap, hiding patterns
  4. Doesn't show causation - Only reveals association between variables
  5. Sensitive to outliers - Extreme values can distort the apparent relationship

Tips for better analysis

  • Always visualize your data before running statistical tests
  • Look for patterns beyond simple linear relationships
  • Consider the context and domain knowledge when interpreting results
  • Report both the correlation coefficient and R-squared for complete information
  • Be cautious about making predictions, especially outside your data range
  • Use multiple data points (at least 10-20) for reliable correlation estimates