Regression analysis
A statistical method that estimates how a result (dependent variable) changes with one or more factors (independent variables). It quantifies drivers, tests significance, and supports forecasting to produce evidence-based lessons and recommendations at project or phase close.
Key Points
- Used to measure the strength and direction of relationships between outcomes (e.g., schedule or cost variance) and potential drivers (e.g., changes, defects, staffing levels).
- Supports closeout by turning performance data into actionable lessons, benefits insights, and governance recommendations.
- Common forms include simple linear, multiple linear, and logistic regression; choose based on the outcome type.
- Relies on clean, sufficient historical data and checks for assumptions such as linearity, independence, and constant variance.
- Results are documented with coefficients, significance tests, and goodness-of-fit metrics to inform future plans and baselines.
- Complements other analyses (Pareto charts, control charts) by quantifying impact rather than just describing frequency or dispersion.
Purpose of Analysis
- Quantify which factors most influenced cost, schedule, quality, or benefits realization.
- Estimate the expected change in the outcome for a one-unit change in each driver.
- Create defensible recommendations for process improvements and standards.
- Provide predictive insight for future projects and portfolio planning.
- Strengthen closeout reporting with statistically supported findings.
Method Steps
- Frame the question: define the outcome to explain (e.g., final schedule slippage in days) and the decision to support.
- Select candidate drivers based on logic and availability (e.g., number of change requests, team turnover, defect rework hours).
- Collect and clean data: handle missing values, remove obvious outliers, and align units and timeframes.
- Explore data visually with scatterplots and correlations to check direction and rough linearity.
- Fit an appropriate model (e.g., multiple linear regression for continuous outcomes; logistic for pass/fail outcomes).
- Evaluate fit and assumptions using R²/adjusted R², RMSE, p-values, residual plots, and multicollinearity diagnostics (e.g., VIF).
- Refine the model: remove weak or collinear predictors, consider transformations or interaction terms where justified.
- Interpret results in business terms and validate with subject matter experts.
- Document findings, limitations, and practical recommendations in closing reports and lessons learned.
Inputs Needed
- Project performance data: CPI, SPI, cost variance, schedule variance, milestone dates, rework hours, defect counts.
- Change log details: number, size, and timing of approved changes.
- Resource metrics: staffing levels, turnover, overtime, skill mix.
- Quality metrics: test coverage, defect severity distribution, escape rate.
- Risk outcomes: realized risks, mitigation actions, residual impacts.
- External/context data: vendor lead times, market events, tool availability.
- Baseline plans and acceptance criteria to anchor comparisons.
- Data dictionary and measurement definitions to ensure consistency.
Outputs Produced
- Model equation with coefficients and interpretation of each driver.
- Statistical significance indicators (p-values, confidence intervals).
- Goodness-of-fit metrics (R²/adjusted R², RMSE, AIC/BIC as applicable).
- Residual and sensitivity analyses, including prediction intervals.
- Visuals: scatterplots with fitted lines, coefficient charts, residual plots.
- Closeout report content and lessons learned entries that specify what to repeat, avoid, or control better.
- Recommendations for standards, estimating factors, and risk thresholds in future work.
Interpretation Tips
- Translate coefficients into practical units (e.g., each approved change added 2.3 days on average).
- Use adjusted R² to compare models with different numbers of predictors.
- Check multicollinearity; high VIF values indicate unstable coefficient estimates.
- Focus on effect sizes and confidence intervals, not just p-values.
- Use prediction intervals to convey uncertainty for future projects.
- Do not infer causation solely from statistical association; apply domain logic and timing evidence.
- Avoid extrapolating beyond the data range used to fit the model.
Example
A software project finished 18 days late. The team modeled schedule slippage (days) using predictors: approved change requests, defect rework hours, and average team size.
- Model summary: Slippage = 1.6 + 2.1×(Changes) + 0.04×(ReworkHours) − 0.8×(AvgTeamSize).
- Fit: adjusted R² = 0.71; all coefficients significant at p < 0.05.
- Interpretation: each additional approved change added ~2.1 days, while adding one team member reduced slippage by ~0.8 days, within the studied range.
- Action: tighten change control late in the project and plan buffer for high-defect modules.
Pitfalls
- Poor data quality or inconsistent measurement undermining results.
- Overfitting with too many predictors for a small number of observations.
- Omitted variable bias from leaving out relevant drivers.
- Multicollinearity causing unstable or counterintuitive coefficients.
- Assumption violations (nonlinearity, heteroscedasticity, autocorrelation) leading to misleading inferences.
- Misinterpreting correlation as causation or cherry-picking results to fit a narrative.
- Ignoring timing and context, such as late-stage changes having different impacts than early ones.
PMP Example Question
During project closure, the PM wants to determine which factors most contributed to a 12% schedule overrun and to create evidence-based lessons learned. Which action best applies regression analysis?
- Create a Pareto chart of delay causes from the issue log.
- Model schedule overrun as the dependent variable with number of approved changes, defect rework hours, and staffing levels as predictors.
- Compute earned schedule to recalculate SPI(t) at completion.
- Run a Monte Carlo simulation of remaining schedule risk.
Correct Answer: B — Model schedule overrun with relevant predictors.
Explanation: Regression quantifies how specific factors relate to the overrun and tests significance. Pareto ranks frequency, earned schedule measures performance, and Monte Carlo simulates uncertainty rather than estimating driver impacts.
HKSM