The Weekend Our Pipeline Processed The Same Data 47 Times

2 hari yang lalu

It was Monday greeting erstwhile our analytics squad started noticing thing wrong. Customer transaction volumes had apparently skyrocketed complete nan weekend, up 4,700% according to nan dashboards. Impossible numbers were everywhere. Our fraud discovery strategy was flagging thousands of suspicious patterns. The executive squad was already asking questions.

I pulled up nan logs pinch a sinking feeling. Sure enough, our accumulation data pipeline had processed Saturday’s data not once, not twice, but 47 times betwixt Saturday day and Monday morning.

The Investigation

The first hint was successful nan Airflow DAG (directed acyclic graph) history. Every fewer hours complete nan weekend, a task had failed, triggered a retry and past succeeded. Normal behavior, isolated from that each “successful” retry processed nan aforesaid date’s information again and again.

I started digging done our PySpark occupation logs. The execution timestamps told nan story: Saturday astatine 2 p.m., Saturday astatine 5 p.m., Saturday astatine 8 p.m., Sunday morning, Sunday afternoon. Each tally showed nan aforesaid execution date successful nan logs but was reprocessing Saturday’s transactions. The pipeline was expected to beryllium idempotent. We’d spent weeks building retry logic specifically to handle transient failures gracefully. Yet present we were pinch 47 copies of nan aforesaid information sitting successful our Snowflake warehouse.

The Root Cause

Our retry logic had a subtle but captious bug. Here’s what was expected to happen:

Task fails (network timeout, impermanent error, etc.)
Airflow triggers retry.
Task re-runs pinch nan aforesaid execution date.
Data gets reprocessed, deduplication logic handles it.

What was really happening:

Task fails processing Saturday’s data.
Our “smart” fallback logic kicked in: “If nan existent day fails, process nan past successful day instead.”
Task succeeded — processing Saturday’s information again.
Next scheduled run: “Process Sunday’s data, but if that fails, autumn backmost to Saturday…”
Sunday processing grounded (different issue).
Fallback processed Saturday again.
Repeat 47 times.

The fallback logic seemed reasonable erstwhile we wrote it. “Always present something” felt safer than “fail completely.” We didn’t recognize we’d created a loop wherever impermanent failures would origin america to many times process old data.

The Debugging Process

Finding nan bug took longer than it should person because nan pipeline showed “success” successful Airflow. Every task is completed pinch a greenish checkmark. The information was landing successful Snowflake. From a workflow perspective, everything looked fine.

The breakthrough came erstwhile I compared execution dates successful our logs pinch nan existent information dates successful nan processed files. They didn’t match. Tasks marked “execution_date=2024-11-10” were processing information from “data date=2024-11-09”.

Once I saw that discrepancy, nan fallback logic became obvious. I recovered nan code:

try:

data = load_data(execution_date)

except DataNotAvailableError:

logger.warning(f"Data for {execution_date} not available, utilizing erstwhile date")

data = load_data(previous_successful_date)

This seemed defensive. But it violated a captious principle: The execution day and nan information day must ever match.

TheFix

The solution had 3 parts:

Remove nan fallback logic entirely. If information isn’t disposable for nan execution date, nan task should fail. Period.

No clever workarounds.

Make idempotency explicit. We added a merge cognition successful Snowflake utilizing nan execution day arsenic portion of nan deduplication key:

MERGE INTO target_table t

USING source_data s

ON t.transaction_id = s.transaction_id

AND t.execution_date = s.execution_date

WHEN MATCHED THEN UPDATE ...

WHEN NOT MATCHED THEN INSERT ...

Add execution day validation. Every shape of nan pipeline now validates that it’s processing nan correct date:

def validate_execution_date(data, expected_date):

actual_date = data['date'].max()

if actual_date != expected_date:

raise ExecutionDateMismatchError(

f"Expected {expected_date}, sewage {actual_date}"

The Recovery

Cleaning up 47 copies of nan aforesaid information wasn’t trivial. We couldn’t conscionable delete everything and reprocess. We had 46 copy copies mixed pinch morganatic information from Sunday and Monday.

We ended up penning a cleanup book that identified duplicates by transaction ID and execution date, keeping only nan first occurrence of each transaction for Saturday. It took six hours to tally and required observant validation afterward.

What I Learned

Idempotency requires discipline. It’s not capable to opportunity “our pipeline is idempotent.” Every retry, each fallback, each “clever” correction handling needs to support nan guarantee: aforesaid input → aforesaid output.
Test pinch play data. Our tests each ran pinch consecutive weekdays. Saturday and Sunday person different information patterns, little volumes, different transaction types, different timing. If we’d tested pinch play data, we would person caught this immediately.
Fail loudly connected information mismatches. The execution day and information day should ever match. When they don’t, something is profoundly wrong. Failing accelerated would person prevented 46 unnecessary retries.
Monitor successful runs too. We had alerts for failures, but we weren’t monitoring whether successful runs were processing nan correct data. We’ve since added information value checks that validate nan day scope of processed data.
Beware of “defensive” code. The fallback logic seemed reasonable: ever present thing alternatively than nothing. But successful information pipelines, delivering nan incorrect information is often worse than delivering thing astatine all.

The Aftermath

Three months later, our pipeline has amended monitoring, clearer correction messages, and ironically, simpler retry logic. We removed nan “clever” fallbacks. Tasks either win pinch nan correct information aliases neglect explicitly.

The incident costs america a play of manual cleanup and immoderate uncomfortable conversations pinch stakeholders about information quality. But it taught nan full squad a valuable instruction astir nan quality betwixt “working” code and reliable accumulation systems.

If your retry logic includes phrases for illustration “fall backmost to” aliases “use erstwhile information instead,” return a person look. You might beryllium 1 grounded task distant from your ain 47x incident.

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.

Group Created pinch Sketch.

Selengkapnya