# Kaggle Competition: Helios Corn Futures Climate Challenge

## Overview

* The notebook creates many derived features from climate risk columns: scaled variants, flaggers (IQR-based), production-weighted features, category-level severity measures, rolling-window exposure-weighted severity (30/60/120 days) and shock indicators.
* Feature selection picks top correlated climate-derived features per futures column (TOP\_K=5), resulting in 16 unique climate features selected in this run.
* CFCS is a custom composite metric prioritizing average magnitude of significant correlations, then max correlation, then proportion of significant correlations.

## Epilogue

This competition targets at engineering a compact set of highly correlated features from the host model’s outputs, along with country market shares and commodity futures prices.

Though I may still grief on missing the deadline, I managed to recover the evaluation scoring function. That discovery alone moved me decisively forward. The evaluator scores you relative to the entire feature set you submit: if your submission contains one strong, highly correlated feature but is padded with many weaker or redundant features, your overall score drops sharply.

This immediately reframed the task: the goal was not to *maximize* the number of signals I could extract, but to *curate* a small portfolio of features whose marginal contributions were consistently positive. In other words, feature engineering here is less about creativity and more about restraint—eliminating anything that dilutes the set’s average utility.&#x20;

With limited time in my hands, my plan was as follows:

1. Locating Plausible features

   1. Recall all strategies used in Feature Engineering. Inference methods include: Correlation Analysis, Causal Relations, Decision Trees, Temporal states, Conditional and Bayes Theory. Condition and transform the features with all kind of activation functions known to mankind, specifically ones not limited to grade 3 mathematics, or more specifically $$\sum\_{i=0}^{N} a\_ix\_i$$.  My passion never stopped at polynomials and am cursed to yearn for Taylors or Mclaurins. What is there to lose in powering the coefficient weights until the `numpys` floating-point precision turns against you?
   2. With Python, I programmed some utility functions to profile insights and drafted termination criterions for the search and ranking algorithms to surface the candidate feature pool.
   3. Iterate and invoke these methods in `OrderedDict` as long as time permits. The key goal here was to not measure myself by what my past self once deemed impossible.&#x20;

   Though, even if my Jupyter notebook produced 400-500 features, the final set was constrained by design—5 per target, then deduplicated.
2. Selecting the top features
   1. The simplest simplest strategy is to infer the pairwise correlations (average) of the independent against dependent features for e.g, temporal features against market future prices, respectively. The immediate thought was to select the top K correlations by absolute correlation values. This strategy is flawed in the sense that its ranking is purely marginal - it ignores redundancy, so highly correlated features can be essentially duplicates rather than mutually informative. &#x20;
   2. I doubted the first steps so much that I may have dropped my senses when deciding on this step. With a few hours left, I allowed myself one more trial. Out of all methods, my folly self chose the least autonomous option: hypothesis testing—ranking features via F-scores, p-values, and t-scores.

I concluded with one feature vector, approx. \~0.8 correlation against a second-month corn future. The applied transformation was a combination of weighted scores of the Price Market Shares Ratio and the insights aggregated from the past rolling 30/60/120 days. Additionally, the data samples suggests that the host's model's outputs are fragile under temporal shifts.  And assuming no technical debt or unintended side effects, it's likely that these features must be weighed compoundely over time.&#x20;

By the competition's evaluation scheme, this single outlier was not enough to raise the average significant correlation. It is apparent that the strongest feature candidates were rarely raw climate counts but weighted transforms conditioned to different time intervals or harvest periods.&#x20;

At this stage, it would be foolish of me to mention **bootstrap sampling methods.** By the end of the first stage, Monte Carlo should already be my ally.&#x20;

I may have recognized the right tool. My failure was not insight, but execution. A debt I will repay in the next iteration.&#x20;

## Introduction

Helios Corn Futures Climate Challenge - Submission Sample

* Objective: build better signals for commodity futures behaviour (prices, returns, volatility, term structure)
* Evaluation Metric: Climate-Futures Correlation Score (CFCS)

## Workspace Configuration

```python
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_regression

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries loaded successfully!")

INPUT_DIR = Path("/kaggle/input")
INPUT_PATH = INPUT_DIR / "forecasting-the-future-the-helios-corn-climate-challenge"
OUTPUT_DIR = Path("/kaggle/working")
OUTPUT_PATH = OUTPUT_DIR / "submission.csv"

df = pd.read_csv(INPUT_PATH / "corn_climate_risk_futures_daily_master.csv")
market_share_df = pd.read_csv('/kaggle/input/forecasting-the-future-the-helios-corn-climate-challenge/corn_regional_market_share.csv')

print(f"Main dataset shape: {df.shape}")
print(f"Date range: {df['date_on'].min()} to {df['date_on'].max()}")
print(f"Countries: {df['country_name'].nunique()}")
print(f"Regions: {df['region_name'].nunique()}")

# Identify climate risk and futures columns
df['date_on'] = pd.to_datetime(df['date_on'])
climate_cols = [c for c in df.columns if c.startswith('climate_risk_')]
futures_cols = [c for c in df.columns if c.startswith('futures_')]
pd.to_datetime(df['date_on']).dt.strftime('%Y-%m-%d')

print(f"Climate risk columns ({len(climate_cols)}):")
for col in climate_cols:
    print(f"  - {col}")

print(f"\nFutures columns ({len(futures_cols)}):")
for col in futures_cols[:10]:  # Show first 10
    print(f"  - {col}")
if len(futures_cols) > 30:
    print(f"  ... and {len(futures_cols) - 10} more")

df.head(5)
```

<details>

<summary>Console output &#x26; sample dataframe (click to expand)</summary>

```
Libraries loaded successfully!
Main dataset shape: (320661, 41)
Date range: 2016-01-01 to 2025-12-15
Countries: 11
Regions: 89
Climate risk columns (12):
  - climate_risk_cnt_locations_heat_stress_risk_low
  - climate_risk_cnt_locations_heat_stress_risk_medium
  - climate_risk_cnt_locations_heat_stress_risk_high
  - climate_risk_cnt_locations_unseasonably_cold_risk_low
  - climate_risk_cnt_locations_unseasonably_cold_risk_medium
  - climate_risk_cnt_locations_unseasonably_cold_risk_high
  - climate_risk_cnt_locations_excess_precip_risk_low
  - climate_risk_cnt_locations_excess_precip_risk_medium
  - climate_risk_cnt_locations_excess_precip_risk_high
  - climate_risk_cnt_locations_drought_risk_low
  - climate_risk_cnt_locations_drought_risk_medium
  - climate_risk_cnt_locations_drought_risk_high

Futures columns (17):
  - futures_close_ZC_1
  - futures_close_ZC_2
  - futures_close_ZW_1
  - futures_close_ZS_1
  - futures_zc1_ret_pct
  - futures_zc1_ret_log
  - futures_zc_term_spread
  - futures_zc_term_ratio
  - futures_zc1_ma_20
  - futures_zc1_ma_60
... (dataframe head printed)
```

</details>

## Utilities

```python
from dataclasses import dataclass, field

@dataclass(slots=True)
class Encoder:
    labels: list = field(init=True, repr=False)
    tag2idx: dict = field(init=False)
    idx2tag: dict = field(init=False)

    def __post_init__(self):
        self.labels = sorted([i.strip() if isinstance(i, str) else "Unknown" for i in self.labels])
        self.tag2idx = {label: idx for idx, label in enumerate(self.labels)}
        self.idx2tag = {idx: label for idx, label in enumerate(self.labels)}

class ClimateLabels:
    """ Climate Weather Feature Signals output from model and not the actual or true values during the insighted event. """

    heat_stress = ['climate_risk_cnt_locations_heat_stress_risk_low','climate_risk_cnt_locations_heat_stress_risk_medium','climate_risk_cnt_locations_heat_stress_risk_high']
    cold_stress = ['climate_risk_cnt_locations_unseasonably_cold_risk_low', 'climate_risk_cnt_locations_unseasonably_cold_risk_medium','climate_risk_cnt_locations_unseasonably_cold_risk_high']
    precip_stress = ['climate_risk_cnt_locations_excess_precip_risk_low','climate_risk_cnt_locations_excess_precip_risk_medium',
    'climate_risk_cnt_locations_excess_precip_risk_high']
    drought_stress = ['climate_risk_cnt_locations_drought_risk_low', 'climate_risk_cnt_locations_drought_risk_medium','climate_risk_cnt_locations_drought_risk_high']
    columns = heat_stress + cold_stress + precip_stress + drought_stress
    # helper labels
    extreme_signals = [
        'climate_risk_cnt_locations_heat_stress_risk_high',
        'climate_risk_cnt_locations_unseasonably_cold_risk_high',
        'climate_risk_cnt_locations_excess_precip_risk_high',
        'climate_risk_cnt_locations_drought_risk_high',
    ]
    medium_signals = [
        'climate_risk_cnt_locations_heat_stress_risk_medium',
        'climate_risk_cnt_locations_unseasonably_cold_risk_medium',
        'climate_risk_cnt_locations_excess_precip_risk_medium',
        'climate_risk_cnt_locations_drought_risk_medium',
    ]
    low_signals = [
        'climate_risk_cnt_locations_heat_stress_risk_low',
        'climate_risk_cnt_locations_unseasonably_cold_risk_low',
        'climate_risk_cnt_locations_excess_precip_risk_low',
        'climate_risk_cnt_locations_drought_risk_low',
    ]
    categories = ["heat_stress", "unseasonably_cold", "excess_precip", "drought"]

# Commodity Furture Pricing Signals
# C=Corn, W=Wheat, S=Soybean
# 1=FRONT MONTH FUTURES, 2=2ND MONTH FUTURES
# Closing price in the front month wrt commodity type
class FutureLabels:
    market_share = ["percent_country_production"]
    front_month_prices = ['futures_close_ZC_1', 'futures_close_ZW_1', 'futures_close_ZS_1']
    # Closing price in the second month wrt commodity type
    second_month_prices = ['futures_close_ZC_2']
    # Daily percentage / logs return for corn front-month
    daily_returns = ['futures_zc1_ret_pct', 'futures_zc1_ret_log']
    # Price diff / ratio of 2nd to front months
    spread_returns = ['futures_zc_term_spread', 'futures_zc_term_ratio']
    # Moving Averages wrt suffix days
    ma_measures = ['futures_zc1_ma_20', 'futures_zc1_ma_60', 'futures_zc1_ma_120']
    # Volatility wrt suffix days
    vol_measures = ['futures_zc1_vol_20', 'futures_zc1_vol_60']
    # quick ref
    measures = ma_measures + vol_measures
    close_prices = front_month_prices + second_month_prices
    columns = close_prices + daily_returns + spread_returns + ma_measures + vol_measures + market_share
    # extra
    cross_relations = [
        'futures_zw_zc_spread',
        'futures_zc_zw_ratio',
        'futures_zs_zc_spread',
        'futures_zc_zs_ratio'
    ]

class MetaLabels:
    identifiers = [
        'ID',
        'crop_name',
        'country_name',
        'country_code',
        'region_name',
        'region_id',
    ]
    temporal = ["harvest_period", "growing_season_year", "date_on"]
    columns = identifiers + temporal
    extra = ['date_on_year', 'date_on_month', 'date_on_year_month']

class ConfigLabels:
    identifiers = ["country_name", "region_name", "harvest_period"]
    futures = FutureLabels.measures
    climate = ClimateLabels
    meta = ["percent_country_production"]
    dt = ["date_on"]
    columns = identifiers + futures + climate.columns + meta
    x = ["harvest_period"] + climate.columns
    y = futures
    submission = ["date_on", "country_name", "region_name"]
    # new columns added
    temperature_risks = ['heat_stress', 'unseasonably_cold']
    precipitation_risks = ['excess_precip', 'drought']

class Labels:
    submission = FutureLabels.columns + FutureLabels.cross_relations + MetaLabels.columns + MetaLabels.extra

def create_submission(data: pd.DataFrame, submission_cols: list, output_file_path: Path = OUTPUT_PATH):
    data["date_on"] = pd.to_datetime(data['date_on']).dt.strftime('%Y-%m-%d')
    data[submission_cols].to_csv(output_file_path, index=False)
    print(f"Submitted file to: {output_file_path}")
```

## Data Preparation

```python
# Create a working copy

print("Preparing Dataset: Merging Market shares with main ... ")

merged_daily_df = df.copy()
merged_daily_df["code"] = merged_daily_df["country_name"] + "_" + merged_daily_df["region_name"]
country_region_encoder = Encoder(merged_daily_df.code.unique().tolist())
harvest_encoder = Encoder(merged_daily_df.harvest_period.unique().tolist())
merged_daily_df["code"] = merged_daily_df["code"].map(country_region_encoder.tag2idx)
merged_daily_df["harvest_period"] = merged_daily_df["harvest_period"].map(harvest_encoder.tag2idx)
merged_daily_df['day_of_year'] = merged_daily_df['date_on'].dt.dayofyear
merged_daily_df['quarter'] = merged_daily_df['date_on'].dt.quarter

print(f"Added basic futures\nDataset shape: {merged_daily_df.shape}")

merged_daily_df = merged_daily_df.merge(
    market_share_df[['region_id', 'percent_country_production']],
    on='region_id',
    how='left'
)

# merged_daily_df['percent_country_production'] = merged_daily_df['percent_country_production'].dropna()
median = merged_daily_df["percent_country_production"].median()
merged_daily_df["percent_country_production"] = merged_daily_df["percent_country_production"].fillna(median)
merged_daily_df["percent_country_production"] = merged_daily_df["percent_country_production"] / 100.0
merged_daily_df = merged_daily_df.sort_values(["code", "date_on"], ascending=False)

print(f"Merged with market share data\nProduction share range: {merged_daily_df['percent_country_production'].min():.1f}% to {merged_daily_df['percent_country_production'].max():.1f}%")
print(f"\n{'>'*10} Current Total {len(merged_daily_df.columns)} new features {'>'*10}")
```

<details>

<summary>Console output (click to expand)</summary>

```
Preparing Dataset: Merging Market shares with main ... 
Added basic futures
Dataset shape: (320661, 44)
Merged with market share data
Production share range: 0.0% to 0.7%

>>>>>>>>>> Current Total 45 new features >>>>>>>>>>
```

</details>

### Flaggers & Tickers

```python
from typing import Callable

LOWER_Q = 0.25
HIGHER_Q = 0.75
MUTATE_SUFFIX = "_scaled_"

def standardize(ds, col: str):
    max_val = ds[col].max()
    min_val = ds[col].min()
    output = (ds[col] - min_val) / (max_val - min_val)
    assert output.max() == 1.0 and output.min() == 0.0
    return output

def mutate_climate_features(df: pd.DataFrame, scaled_fn: Callable = standardize):
    merged_daily_df = df.copy()

    for col in ClimateLabels.columns:
        if str(col) != "harvest_period":
            col_label = f"{col}_scaled_"
            merged_daily_df[col_label] = scaled_fn(merged_daily_df, col=col)

    print(f"+ Added {len(ClimateLabels.columns)} Features of mutated / scaled / standardized datasets")

    return merged_daily_df

def create_conditional_climate_flaggers(ds: pd.DataFrame, scaled_suffix: str = MUTATE_SUFFIX):

    df = ds.copy()
    climate_risk_cols = [col for col in df.columns if str(col).startswith("climate_risk_cnt_locations_") and str(col).endswith(scaled_suffix)]

    for col in climate_risk_cols:
        df[f"{col}_flagger_"] = df[col].between(df[col].quantile(LOWER_Q), df[col].quantile(HIGHER_Q), inclusive="both").astype(int)

    print(f"+ Added {len(climate_risk_cols)} Flaggers from standardized features")
    return df

def create_weighted_climate_features(ds: pd.DataFrame, weight_col: str = "percent_country_production"):
    df = ds.copy()
    climate_risk_cols = [col for col in df.columns if str(col).startswith("climate_risk_cnt_locations_") and not str(col).endswith("_")]

    c = 0
    for risk_col in climate_risk_cols:
        df[f"{risk_col}_{weight_col}_weighted_"] = df[risk_col] * df[weight_col]
        df[f"{risk_col}_{weight_col}_weighted_sq_"] = df[risk_col] * (df[weight_col] ** 2)
        c+=2

    print(f"+ Added {c} weighted climate features")

    return df


def add_category_severity_features(df: pd.DataFrame, eps: float = 1e-9) -> pd.DataFrame:
    df = df.copy()

    mapping = {
        "heat_stress": ClimateLabels.heat_stress,
        "unseasonably_cold": ClimateLabels.cold_stress,
        "excess_precip": ClimateLabels.precip_stress,
        "drought": ClimateLabels.drought_stress,
    }

    for cat, cols in mapping.items():
        low, med, high = cols
        total = df[low] + df[med] + df[high]
        df[f"climate_{cat}_severity_"] = (df[med] + 2.0 * df[high]) / (total + eps)
        df[f"climate_{cat}_high_share_"] = df[high] / (total.replace(0, np.nan))
        df[f"climate_{cat}_total_cnt_"] = total

    return df

def add_group_dynamics(
    df: pd.DataFrame,
    cols: list[str],
    group_cols=("harvest_period", "country_name", "region_name"),
    time_col="date_on",
    windows=(30, 60, 120),
) -> pd.DataFrame:
    df = df.sort_values([*group_cols, time_col]).copy()

    for c in cols:
        g = df.groupby(list(group_cols))[c]
        df[f"{c}_diff1"] = g.diff()
        thr = df[f"{c}_diff1"].quantile(0.95)
        df[f"{c}_shock_up"] = (df[f"{c}_diff1"] > thr).astype(int)

        for w in windows:
            df[f"{c}_roll{w}_sum_"] = g.transform(lambda s: s.rolling(w, min_periods=1).sum())
            df[f"{c}_ewm{w}_"] = g.transform(lambda s: s.ewm(span=w, adjust=False).mean())

    return df

# Create Flaggers
merged_daily_df = mutate_climate_features(merged_daily_df, standardize)
merged_daily_df = create_conditional_climate_flaggers(merged_daily_df)
merged_daily_df = create_weighted_climate_features(merged_daily_df)
merged_daily_df = add_category_severity_features(merged_daily_df)
merged_daily_df = add_group_dynamics(merged_daily_df, cols=ClimateLabels.medium_signals)

print(f"\n{'>'*10} Current Total {len(merged_daily_df.columns)} new features {'>'*10}")
```

<details>

<summary>Console output (click to expand)</summary>

```
+ Added 12 Features of mutated / scaled / standardized datasets
+ Added 12 Flaggers from standardized features
+ Added 24 weighted climate features

>>>>>>>>>> Current Total 137 new features >>>>>>>>>>
```

</details>

### Temporal Features & Risk Momentum & Deltas

* Risk Momentums
* Cross Country Relations
* Price Indicators and volatility

```python
import numpy as np

tds = [30, 60, 120]
grp = ["harvest_period", "code"]
time_col = "date_on"  # change if needed
weight_col = FutureLabels.market_share[0]  # "percent_country_production"
eps = 1e-9

# rolling depends on row order
merged_daily_df = merged_daily_df.sort_values(grp + [time_col]).copy()

for idx, window in enumerate(reversed(tds), start=1):
    wgt = idx / 2

    # keep track of per-category exposure-weighted severity cols for a window-level composite
    window_cat_scores = []

    for risk_col in ClimateLabels.categories:
        level_cols = {
            "low":    f"climate_risk_cnt_locations_{risk_col}_risk_low",
            "medium": f"climate_risk_cnt_locations_{risk_col}_risk_medium",
            "high":   f"climate_risk_cnt_locations_{risk_col}_risk_high",
        }

        created = {}

        # 1) weighted rolling features for each level
        for level, climate_col in level_cols.items():
            if climate_col not in merged_daily_df.columns:
                raise KeyError(f"Missing column: {climate_col}")

            new_col = f"climate_{risk_col}_roll{window}_{level}_sum_"
            merged_daily_df[new_col] = (
                merged_daily_df
                .groupby(grp)[climate_col]
                .transform(lambda s: wgt * s * s.rolling(window, min_periods=1).sum())
                .fillna(0)
            )
            created[level] = new_col

        # 2) severity on the *rolled-weighted* features (dimensionless, robust)
        sev_col = f"climate_{risk_col}_roll{window}_severity_"
        low_c, med_c, high_c = created["low"], created["medium"], created["high"]
        total = merged_daily_df[low_c] + merged_daily_df[med_c] + merged_daily_df[high_c]

        merged_daily_df[sev_col] = (merged_daily_df[med_c] + 2.0 * merged_daily_df[high_c]) / (total + eps)

        # 3) exposure-weighted severity (production importance)
        sev_x_prod = f"{sev_col}x_prod_"
        merged_daily_df[sev_x_prod] = merged_daily_df[sev_col] * merged_daily_df[weight_col]
        window_cat_scores.append(sev_x_prod)

        # 4) dynamics: diff + shock
        diff_col = f"{sev_x_prod}diff1_"
        shock_col = f"{sev_x_prod}shock_up_"

        merged_daily_df[diff_col] = merged_daily_df.groupby(grp)[sev_x_prod].diff()
        thr = merged_daily_df[diff_col].quantile(0.95)
        merged_daily_df[shock_col] = (merged_daily_df[diff_col] > thr).astype(int)

        print(f"[{risk_col}] severity + exposure-weighted severity")
        print(merged_daily_df[[sev_col, sev_x_prod, diff_col, shock_col]].describe())

    # 5) window-level composite score across categories (mean exposure-weighted severity)
    comp_col = f"climate_roll{window}_risk_score_x_prod_"
    merged_daily_df[comp_col] = merged_daily_df[window_cat_scores].mean(axis=1)

    # optional: z-score composite per dataset (comment out if you don't want standardization)
    mu = merged_daily_df[comp_col].mean()
    sd = merged_daily_df[comp_col].std(ddof=0)
    merged_daily_df[f"{comp_col}z_"] = (merged_daily_df[comp_col] - mu) / (sd + 1e-12)

    print(f"\n[COMPOSITE] window={window} -> {comp_col}")
    print(merged_daily_df[[comp_col, f"{comp_col}z_"]].describe())

print(f"\n{'>'*10} Current Total {len(merged_daily_df.columns)} columns {'>'*10}")
```

<details>

<summary>Large console output of severity / composite stats (click to expand)</summary>

(Output trimmed in the original; the script prints descriptive statistics for each category and window, and final counts)

```
>>>>>>>>>> Current Total 227 columns >>>>>>>>>>
```

</details>

## Feature Selection

```python
TOP_K = 5
fts = pd.DataFrame()
all_climate_features = [i for i in merged_daily_df.columns if i.startswith("climate")]

for future_col in FutureLabels.columns:
    if future_col != "percent_country_production":
        corrs = merged_daily_df[all_climate_features].corrwith(merged_daily_df[future_col]).abs().sort_values(ascending=False).head(TOP_K)
        corrs_df = corrs.reset_index().rename(columns={"index": "feature", 0: "corr"})
        corrs_df["returns"] = future_col
        fts = pd.concat([fts, corrs_df], axis=0)

report_df = fts.sort_values("corr", ascending=False)
final_climate_cols = list(fts.feature.unique())

print(f"\n{'>'*10} Selected {fts.feature.nunique()} columns {'>'*10}")
```

<details>

<summary>Console output (click to expand)</summary>

```
>>>>>>>>>> Selected 16 columns >>>>>>>>>>
```

</details>

## Evaluation Test

```python
def compute_monthly_climate_futures_correlations(df):

    # Dynamic detection
    climate_cols = [c for c in df.columns if c.startswith("climate_risk_")]
    futures_cols = [c for c in df.columns if c.startswith("futures_")]

    # Remove future data
    max_valid_date = df["date_on"].max()
    df = df[df["date_on"] <= max_valid_date]

    results = []
    # Loop by commodity + month
    for comm in df["crop_name"].unique():
        df_comm = df[df["crop_name"] == comm]

        for country in sorted(df_comm["country_name"].unique()):
            df_country = df_comm[df_comm["country_name"] == country]

            for month in sorted(df_country["date_on_month"].unique()):
                df_month = df_country[df_country["date_on_month"] == month]

                for clim in climate_cols:
                    for fut in futures_cols:

                        if df_month[clim].std() > 0 and df_month[fut].std() > 0:
                            corr = df_month[[clim, fut]].corr().iloc[0, 1]
                        else:
                            corr = None

                        results.append({
                            "crop_name": comm,
                            "country_name": country,
                            "month": month,
                            "climate_variable": clim,
                            "futures_variable": fut,
                            "correlation": corr
                        })

    results_df = pd.DataFrame(results)
    results_df['correlation'] = results_df['correlation']

    return results_df

def calculate_cfcs_score(correlations_df):
    """
    Calculate the Climate-Futures Correlation Score (CFCS) for leaderboard ranking.

    CFCS = (0.5 × Avg_Sig_Corr_Score) + (0.3 × Max_Corr_Score) + (0.2 × Sig_Count_Score)

    Focus on significant correlations (≥ |0.5|) only for average calculation.
    """
    # Remove null correlations
    valid_corrs = correlations_df["correlation"].dropna()

    if len(valid_corrs) == 0:
        return {'cfcs_score': 0.0, 'error': 'No valid correlations'}

    # Calculate base metrics
    abs_corrs = valid_corrs.abs()
    max_abs_corr = abs_corrs.max()
    significant_mask = abs_corrs >= 0.5
    significant_corrs = abs_corrs[significant_mask]
    significant_count = len(significant_corrs)
    total_count = len(valid_corrs)

    # Calculate component scores - ONLY average significant correlations
    if significant_count > 0:
        avg_sig_corr = significant_corrs.mean()
        avg_sig_score = min(100, avg_sig_corr * 100)  # Cap at 100 when avg sig reaches 1.0
    else:
        avg_sig_corr = 0.0
        avg_sig_score = 0.0

    max_corr_score = min(100, max_abs_corr * 100)  # Cap at 100 when max reaches 1.0
    sig_count_score = (significant_count / total_count) * 100  # Percentage

    # Composite score: Focus more on quality of significant correlations
    cfcs = (0.5 * avg_sig_score) + (0.3 * max_corr_score) + (0.2 * sig_count_score)

    return {
        'cfcs_score': round(cfcs, 2),
        'avg_significant_correlation': round(avg_sig_corr, 4),
        'max_abs_correlation': round(max_abs_corr, 4),
        'significant_correlations_pct': round(sig_count_score, 2),
        'avg_sig_score': round(avg_sig_score, 2),
        'max_corr_score': round(max_corr_score, 2),
        'sig_count_score': round(sig_count_score, 2),
        'total_correlations': total_count,
        'significant_correlations': significant_count
    }

def run_scoring(merged_daily_df: pd.DataFrame):
    monthly_corr_df = compute_monthly_climate_futures_correlations(merged_daily_df)
    score_results = calculate_cfcs_score(monthly_corr_df)

    print("=== CLIMATE-FUTURES CORRELATION SCORE (CFCS) ===")
    print(f"Final CFCS Score: {score_results['cfcs_score']}\n")
    print("Component Breakdown:")
    print(f"  Average Significant |Correlation|: {score_results['avg_significant_correlation']:.4f} → Score: {score_results['avg_sig_score']}")
    print(f"  Maximum |Correlation|: {score_results['max_abs_correlation']:.4f} → Score: {score_results['max_corr_score']}")
    print(f"  Significant Correlations: {score_results['significant_correlations']}/{score_results['total_correlations']} ({score_results['significant_correlations_pct']:.1f}%) → Score: {score_results['sig_count_score']}\n")
    print("Score Calculation:")
    print(f"  CFCS = (0.5 × {score_results['avg_sig_score']}) + (0.3 × {score_results['max_corr_score']}) + (0.2 × {score_results['sig_count_score']})")
    print(f"  CFCS = {0.5 * score_results['avg_sig_score']:.1f} + {0.3 * score_results['max_corr_score']:.1f} + {0.2 * score_results['sig_count_score']:.1f} = {score_results['cfcs_score']}")
    print("Key Insight: This metric focuses on the QUALITY of significant correlations rather than being diluted by weak signals.")
```

## Submission & Scoring

```python
# final = submission.merge(right=merged_daily_df, on=MetaLabels.identifiers + ["date_on"] + MetaLabels.extra, how="left")

try:
    #merged_daily_df["date_on"] = pd.to_datetime(merged_daily_df['date_on']).dt.strftime('%Y-%m-%d').info()
    # merged_daily_df.to_csv(OUTPUT_PATH, index=False)
    final_cols = list(final_climate_cols) + Labels.submission
    # run_scoring(merged_daily_df[final_cols])
    create_submission(data=merged_daily_df, submission_cols=final_climate_cols)
except Exception as e:
    raise e
```

<details>

<summary>Final scoring output (click to expand)</summary>

```
=== CLIMATE-FUTURES CORRELATION SCORE (CFCS) ===
Final CFCS Score: 51.53

Component Breakdown:
  Average Significant |Correlation|: 0.5721 → Score: 57.21
  Maximum |Correlation|: 0.7600 → Score: 76.0
  Significant Correlations: 111/17289 (0.6%) → Score: 0.64

Score Calculation:
  CFCS = (0.5 × 57.21) + (0.3 × 76.0) + (0.2 × 0.64)
  CFCS = 28.6 + 22.8 + 0.1 = 51.53
Key Insight: This metric focuses on the QUALITY of significant correlations rather than being diluted by weak signals.
```

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/side-projects/2026-competitions/kaggle-competition-helios-corn-futures-climate-challenge.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
