← Back to articles
Housing

Backtesting the NAI: What 14 Years of Data Told Us About Our Own Assumptions

Backtesting the NAI: What 14 Years of Data Told Us About Our Own Assumptions

We recently published how the Nimble Attainability Index works — nine metrics, five data sources, one weighted score. What we didn't share was the question that kept us up at night: do the weights actually predict anything?

The NAI weights — income-price gap at 20%, supply pressure at 15%, and so on — were derived from first principles. We reasoned that the gap between what people can afford and what's being sold should matter most. We cited established benchmarks (NAR's balanced market definition, FHA qualifying ratios). The logic was sound.

But logic isn't evidence. So we built a backtesting engine, pointed it at 14 years of historical market data, and asked: when the NAI said a market was buyer-favorable, did buyers actually get favorable outcomes?

The answer was humbling.

The Setup

What We Had

Our PostgreSQL database already contained:

  • 738 weekly mortgage rate observations from FRED, spanning January 2012 to February 2026
  • Monthly Redfin market data for 8+ cities in California's Central Valley — median prices, days on market, inventory, sale-to-list ratios, price drops, and homes sold. Most cities had 156-169 months of history.

Combined, that's roughly 1,600 city-month data points. Enough to test whether a scoring model has signal.

What We Built

A Buyer Outcome Score (BOS). This is the "ground truth" — a composite measure of whether conditions actually favored buyers in a given month. It uses five realized metrics:

  1. Days on market — Were buyers able to take their time? (60+ days = strong buyer market, <15 = seller's market)
  2. Sale-to-list ratio — Did buyers get discounts? (Below 0.95 = buyers negotiating well)
  3. Price drops — Were sellers cutting prices? (30%+ of listings with cuts = buyer leverage)
  4. Months of supply — Was there inventory to choose from? (6+ months = buyer's market)
  5. Year-over-year price change — Were prices stabilizing or falling? (Negative = improving affordability)

Each scores 0-100, averaged into a single BOS. Higher = better for buyers.

A backfill engine. For each city-month, we retroactively computed what the NAI would have scored using historical data and a given weight configuration. This lets us test any set of weights against the full history.

A correlation engine. We computed the Pearson correlation between NAI at time T and BOS at time T+3 months (a 3-month predictive lead). If the NAI is useful, high scores at time T should precede favorable buyer outcomes at T+3.

The Key Question

Does a high NAI today predict a better market for buyers three months from now?

The Results

Overall: Weak Signal

| Metric | Value | |--------|-------| | Overall correlation (3-month lead) | 0.053 | | Overall correlation (6-month lead) | -0.091 |

A correlation of 0.053 is essentially noise. For context, you'd want to see at least 0.3 for "moderate" predictive power and 0.5+ for "strong." The NAI, as originally weighted, is no better than chance at predicting which markets will favor buyers three months later.

The 6-month lead was slightly negative, meaning the NAI was directionally wrong at longer horizons.

Per-City: Uneven Performance

| City | Data | 3-Month Correlation | 6-Month Correlation | |------|------|--------------------|--------------------| | Turlock | 156 months | 0.248 | 0.195 | | Stockton | 169 months | 0.111 | -0.103 | | Modesto | 307 months | 0.101 | 0.014 | | Patterson | 156 months | 0.099 | -0.069 | | Sacramento | 338 months | 0.031 | -0.128 | | Manteca | 169 months | 0.002 | -0.244 | | Tracy | 169 months | -0.070 | -0.255 | | Ripon | 167 months | -0.094 | -0.138 |

Turlock stood out with 0.248 correlation — moderate and arguably useful. Modesto and Stockton showed weak positive signal. But Tracy, Ripon, and Manteca showed negative correlation. The NAI was pointing the wrong direction for those markets.

Why the Weights Fail: Static vs. Dynamic

The sensitivity analysis told the real story. We tested what happens when you shift each weight by +/- 5%:

| Weight Change | Resulting Correlation | |--------------|----------------------| | income_gap -5% | 0.117 (+0.064) | | absorption +5% | 0.087 (+0.034) | | market_health +5% | 0.087 (+0.034) | | supply_pressure +5% | 0.083 (+0.030) | | price_momentum +5% | 0.081 (+0.028) | | rent_price -5% | 0.081 (+0.028) | | fha_headroom -5% | 0.069 (+0.016) | | income_gap +5% | -0.000 (-0.053) |

The pattern is clear: reducing the weight on static affordability metrics (income_gap, fha_headroom) and increasing the weight on dynamic market metrics (absorption, market_health, price_momentum, supply_pressure) improves predictive power.

This makes intuitive sense once you see it:

  • Income-price gap changes slowly. Median income shifts by 2-3% per year. At 20% weight, the NAI is anchored to a value that barely moves, while actual buyer outcomes swing with inventory cycles, rate changes, and market sentiment.
  • Absorption rate and market health change monthly. They capture the momentum that actually determines whether the next 90 days will favor buyers.
  • Price momentum is the derivative of affordability — it tells you which direction the market is heading, which matters more for a 3-month prediction than where it currently sits.

The original weights treated the NAI like a snapshot tool. The backtest revealed it needs to be a momentum tool.

What This Means

For Our Model

The current weights are defensible for describing affordability at a point in time — "this market is more affordable than that one." The income-price gap genuinely is the most important factor for answering "can people afford homes here?"

But they're poor at predicting whether conditions will improve. For that, you need to weight the dynamic signals more heavily. A market with a large income-price gap but rising inventory, increasing days on market, and falling sale-to-list ratios is about to get more attainable — and the current weights undercount those signals.

For Anyone Building Scoring Models

Three lessons generalize beyond housing:

1. Expert intuition is a hypothesis, not a conclusion. The original weights felt right. They were grounded in established frameworks and sound reasoning. But "sounds right" and "predicts outcomes" are different claims. The only way to know is to test against data.

2. Static variables dominate descriptive models; dynamic variables dominate predictive models. If you're asking "what is the state of things right now," weight the structural factors. If you're asking "what will happen next," weight the rates of change. Most scoring systems conflate these two questions.

3. Correlation varies by market. Turlock at 0.248 and Tracy at -0.070 use identical metrics, identical weights, identical scoring functions. The difference is the market itself. Some markets are well-characterized by these nine metrics. Others have dynamics (local employment shifts, specific development patterns, commuter corridor effects) that the model doesn't capture. Acknowledging where your model fails is as valuable as celebrating where it succeeds.

What We're Doing About It

Grid Search Results

Rather than manually adjusting weights based on the sensitivity analysis, we built an automated grid search. It tests weight combinations at 5% resolution, constrained to sum to 1.0, and measures which configuration maximizes the average predictive correlation across all 8 cities.

The result confirmed what the sensitivity analysis hinted at:

| Component | Original Weight | Optimized Weight | Change | |-----------|----------------|-----------------|--------| | Income-Price Gap | 20% | 5% | -15% | | Supply Pressure | 15% | 30% | +15% | | FHA Headroom | 10% | 9% | -1% | | Absorption | 10% | 9% | -1% | | Rent-to-Price | 10% | 9% | -1% | | FHA Utilization | 5% | 9% | +4% | | Pipeline | 10% | 9% | -1% | | Market Health | 10% | 10% | — | | Price Momentum | 10% | 10% | — |

| Metric | Original | Optimized | |--------|----------|-----------| | Predictive correlation | 0.053 | 0.317 | | Candidates tested | — | 83 | | Data points | 1,631 | 1,631 |

The correlation jumped from 0.053 to 0.317 — a 6x improvement, moving from "noise" to "moderate predictive signal."

The biggest move: income_gap dropped from 20% to 5% and supply_pressure doubled from 15% to 30%. The model is saying: for predicting what happens next in a market, inventory dynamics matter far more than the current affordability snapshot.

This makes domain sense. Income-to-price ratios are structural — they shift slowly over years. But months of supply can swing from 2 to 6 in a single quarter as listings accumulate or get absorbed. That volatility is exactly what drives the short-term buyer outcomes we're measuring.

The remaining seven components converged to roughly equal weight (9-10% each), suggesting they contribute useful but undifferentiated signal when supply pressure is properly weighted.

Governance: Weights Never Auto-Update

The grid search produces a proposal, not a deployment. The flow:

  1. Quarterly: system runs backtest and grid search
  2. System proposes new weights with correlation evidence
  3. Human reviews: Is the improvement meaningful? Is it overfitting? Does it make domain sense?
  4. Human activates new weights (or keeps current ones)
  5. Old weights preserved in version history for audit

We store every weight configuration with its validation score, the number of data points used, and the lead time tested. If a future weight set underperforms, we can roll back to any previous version.

What We Won't Do

We won't chase perfect correlation. The 0.317 result lands squarely in the "moderate" range, and we're comfortable there. The BOS itself is a simplified composite. Real buyer outcomes depend on individual circumstances (credit score, down payment, specific property features) that no market-level index can capture. We'd rather have an honest 0.3 than an overfit 0.7.

We won't overfit to the backtest period. 2012-2026 includes the post-crisis recovery, a historic bull market, a rate shock, and a period of elevated uncertainty. A weight set optimized for this specific history might not generalize. We're watching for configurations that show consistent signal across sub-periods, not just the highest aggregate number.

The Broader Point

Most indices in housing are published with fixed methodologies and never questioned. The NAR Housing Affordability Index has used the same formula for decades. The NAHB/Wells Fargo Housing Opportunity Index weights all markets equally. These are useful tools, but they don't learn.

We think the right approach is to be transparent about what works and what doesn't, test assumptions against data, and update when the evidence warrants it. The NAI's initial weights were wrong — or at least, less right than they could be. That's not a failure of the system. It's the system working as designed.

The goal was never to get the weights right on the first try. The goal was to build infrastructure that tells us when they're wrong and what better looks like.


Technical Appendix

For those who want to reproduce or adapt this approach:

Correlation method: Pearson's r between NAI(T) and BOS(T + lead_months), computed per city then averaged. We tested 3-month and 6-month leads.

Data requirements: Monthly observations with median sale price, days on market, sale-to-list ratio, price drops percentage, months of supply, and year-over-year price change. Minimum 12 paired observations per city for correlation to be computed.

BOS components: Equal-weighted average of 5 sub-scores (DOM, sale-to-list, price drops, months of supply, YoY price change), each scaled 0-100 with thresholds calibrated to NAR and industry benchmarks. Requires at least 3 of 5 components to produce a score.

Grid search constraints: All 9 weights must be positive (minimum 1%), must sum to 1.0, tested at 5% resolution. Major weights (income_gap, supply_pressure, market_health, price_momentum) searched over wider ranges; minor weights distributed among remaining components.

Tools: Python, PostgreSQL, no ML libraries. Pearson correlation implemented directly (no scipy/numpy dependency). Entire backtest runs in under 5 minutes on commodity hardware.


Sources