Codeaza Technologies · Data Engineering

Rulebook — S3 Data Store: Technical Specification

Observed bucket layout, the storage contract the application currently writes, concrete defects, and a target file/folder hierarchy with naming spec, access model, lifecycle, and a copy-level migration runbook.

Bucketstock-rulebook-prod
Top-level prefixes86
Objects~13.4k
AudienceTechnical Lead / App team
● Observed Contract

The per-dataset pattern the app actually writes

Every exchange/regulator prefix follows the same 4-folder shape — this is the implicit contract, it's just not enforced or centralised.

# Observed under every exchange prefix (e.g. fee_cboe_us_options/, fee_nyse/)
<source>/
├── annotations/{YYYY-MM-DD}/{uuid32}.json      # per-day annotation blobs
├── differences/{YYYY-MM-DD}/latest_annotated_*.pdf
│                              recent_annotated_*.pdf
│                              *_changes.html
├── parsed/# parsed source docs
└── processed_chunks/# chunked text for embeddings/RAG
TakeawayThe data model is sound — date-partitioned, typed by stage (annotations / differences / parsed / processed_chunks). The problem is purely where it lives (one public bucket) and how the top-level key is named (uncontrolled), not the per-dataset shape.
● Current State

Actual top-level tree of stock-rulebook-prod

86 prefixes, one bucket, public-read. Annotated below with disposition: keep→data move evict/fix.

stock-rulebook-prod/  (PUBLIC read on /*)
│
├── ── market & regulatory DATA (~74 prefixes) ──            keep → rulebook-prod-data
│   ├── fee_nyse/  fee_nyse_american/  fee_nyse_arca/  fee_nyse_national/ …
│   ├── fee_cboe_us_options/  fee_cboe_bzx_equities/  fee_cboe_europe/ … (10x cboe)
│   ├── fee_miax_options/  fee_miax_pearl_equities/ … (5x miax)
│   ├── fee_euronext/  fee_six_swiss_exchange/  fee_xetra_german/  fee_aquis/ …
│   ├── fee_turquoise/  fee_turquoise_tghe/  fee_turquoise_tghl/
│   ├── nasdaq/  nasdaq_equity_fix/  nasdaq_options_fix/  nasdaqbx/  nasdaq_texas/
│   ├── nasdaq_gemx/  nasdaqgemx/   nasdaq_ise/  nasdaqise/      ◄ DUPLICATES
│   ├── nasdaq_mrx/  nasdaqmrx/     nasdaq_phlx/ nasdaqphlx/     ◄ DUPLICATES
│   ├── nyse/  nyse_equity_fix/  london_stock_exchange/  fee_london_stock_exchange/
│   ├── bloomberg/  bloomberg_mtf/  eurex/  jane_street/  lch_equity/  market_assess_mtf/
│   ├── fee_wierner_boerse/  fee_warshaw_exchange/                       ◄ TYPOS baked into keys
│   ├── fca_handbook/  finra/  sec/  reg_nfa/                            # regulators
│   └── fee-collections/  fee_extraction/  expected-outputs/             # pipeline working data
│
├── assets/avatars/avatar_1.png                  move → rulebook-prod-assets (CDN/OAC)
├── logs/{YYYY-MM-DD}/*_scraper_logs.log         move → rulebook-prod-logs
├── api-logs/%2024-%17-%0/API_LOGS.*             move + FIX broken date formatting
├── FTP_users/culvert_capital/*.xlsx          EVICT — client trade data in PUBLIC bucket
├── Testing_FTP/ · demo/ · evals/ · feedback-tasks/  move → non-prod bucket
├── stock-rulebook-staging/                    staging data inside prod → staging bucket
├── monthly_mail.env · weekly_mail.env         secrets (blocked; rotate + move to Secrets Mgr)
└── V_1 · "/" (leading-slash keys)             junk / path bug — investigate + remove
● Defects

Concrete defects found (with real evidence)

#DefectEvidenceSeverity
D1Secrets in a public bucketmonthly_mail.env, weekly_mail.env (EMAIL+PASSWORD) anonymously downloadableCRITICAL
D2Client trade data in a public bucketFTP_users/culvert_capital/Culvert Capital Trades - RBC Final.xlsxCRITICAL
D3Duplicate prefixes for one source (two code paths)nasdaq_gemx/ & nasdaqgemx/ both hold full annotations/differences/parsed/processed_chunks (different dates)HIGH
D4Broken date formatting in keysapi-logs/%2024-%17-%0/ — unsubstituted strftime placeholders (literal % in path)HIGH
D5Leading-slash keysa top-level / prefix → //… keys from bad path joinMED
D6Inconsistent prefix conventionfee_london_stock_exchange/ vs london_stock_exchange/; fee_ on some sources, bare on othersMED
D7Casing inconsistencyFTP_users/, Testing_FTP/ vs lowercase-snake everywhere elseLOW
D8Typos frozen into keysfee_wierner_boerse (wiener), fee_warshaw_exchange (warsaw), london_stcok in a log filenameLOW
D9Spaces in object keysCulvert Capital Trades - RBC Final.xlsx — breaks URLs/toolingLOW
● Dedup

Variant → canonical mapping

Resolve each variant to a single canonical slug, copy objects, verify, delete the old prefix. Confirmed duplicates first; review-required pairs flagged.

Variants foundCanonical slugAction
nasdaq_gemx/ + nasdaqgemx/nasdaq-gemxmerge (confirmed dup)
nasdaq_ise/ + nasdaqise/nasdaq-isemerge (confirmed dup)
nasdaq_mrx/ + nasdaqmrx/nasdaq-mrxmerge (confirmed dup)
nasdaq_phlx/ + nasdaqphlx/nasdaq-phlxmerge (confirmed dup)
fee_london_stock_exchange/ + london_stock_exchange/london-stock-exchangereview then merge
fee_nyse*/ vs nyse/ / nyse_equity_fix/nyse, nyse-equity-fixlikely distinct feeds — confirm
fee_wierner_boerse/wiener-boerserename (typo)
fee_warshaw_exchange/warsaw-exchangerename (typo)
Merge rule: when both variants hold the same date, keep the newer object and log the collision — don't blindly overwrite. Most are date-disjoint (e.g. nasdaqgemx has 2025-11, nasdaq_gemx has 2026-05), so the merge is mostly additive.
● Target

Proposed file/folder hierarchy

One bucket per purpose; a single canonical key scheme inside the data bucket. Only the assets bucket is web-exposed (via CloudFront OAC — the bucket itself stays private).

rulebook-prod-data/            PRIVATE · versioned · IA@90d
├── exchanges/
│   └── {exchange-slug}/            # e.g. nasdaq-gemx, cboe-us-options, nyse-arca
│       ├── annotations/{YYYY}/{MM}/{DD}/{uuid}.json
│       ├── differences/{YYYY}/{MM}/{DD}/{file}
│       ├── parsed/{YYYY}/{MM}/{DD}/{file}
│       └── processed-chunks/{YYYY}/{MM}/{DD}/{file}
├── regulators/
│   └── {body-slug}/{annotations|differences|parsed}/{YYYY}/{MM}/{DD}/…   # finra, sec, fca-handbook, reg-nfa
└── pipeline/
    ├── fee-collections/ …
    ├── fee-extraction/ …
    └── expected-outputs/# test fixtures → consider non-prod

rulebook-prod-assets/          PRIVATE bucket, served ONLY via CloudFront OAC
└── avatars/ · img/ · static/

rulebook-prod-logs/            PRIVATE · expire 30–90d
├── app/{YYYY}/{MM}/{DD}/…           # was logs/
└── api/{YYYY}/{MM}/{DD}/…           # was api-logs/ (with FIXED date formatting)

rulebook-config/               PRIVATE — or migrate to AWS Secrets Manager
└── mail/{monthly,weekly}.env       # after password rotation

rulebook-ftp/                  PRIVATE · per-user IAM/SFTP scope
└── {client-slug}/# was FTP_users/ — client trade data, locked down

rulebook-nonprod/              PRIVATE · aggressive expiry
└── demo/ · evals/ · feedback-tasks/ · testing-ftp/
● Naming Spec

Canonical key contract (enforce in one place)

# key_builder.py — the ONLY way the app may construct an S3 key
import re
from datetime import date

def slug(name: str) -> str:
    # "NASDAQ GEMX" / "nasdaq_gemx" / "nasdaqgemx"  ->  "nasdaq-gemx"
    return re.sub(r'[^a-z0-9]+', '-', name.strip().lower()).strip('-')

def data_key(source: str, stage: str, d: date, filename: str) -> str:
    assert stage in {"annotations","differences","parsed","processed-chunks"}
    key = f"exchanges/{slug(source)}/{stage}/{d:%Y/%m/%d}/{filename}"
    return validate(key)

def validate(key: str) -> str:
    assert not key.startswith('/')           # D5: no leading slash
    assert '//' not in key                    # D5: no empty segments
    assert '%' not in key                      # D4: no unsubstituted strftime
    assert ' ' not in key                      # D9: no spaces
    assert key == key.lower()                  # D7: lowercase only
    return key

Rules

  • Lowercase, hyphen-separated slugs — one function, one source of truth.
  • Date as /YYYY/MM/DD/ path segments (clean lifecycle + prefix listing).
  • Validate on every write — fail loud, never silently create a malformed key.

Enforcement

  • Remove every ad-hoc f"{x}/" string concat at call sites.
  • Add a unit test asserting the 4 stages + the validate() guards.
  • Optional: a bucket policy denying keys with %/uppercase as a backstop.
● Access & Lifecycle

Per-bucket access model + lifecycle

BucketPublic?Who writesWho readsLifecycle
rulebook-prod-datanoapp task-role (PutObject)app + devs (GetObject)versioning; noncurrent→IA@90d
rulebook-prod-assetsCDN onlyapppublic via CloudFront OAC
rulebook-prod-logsnoappdevs/observabilityexpire @30–90d
rulebook-confignoadmin onlyapp task-roleversioning
rulebook-ftpnoSFTP per-userper-user scopedpolicy-based
rulebook-nonprodnoapp (non-prod)devsexpire @30d
Net effectExactly one public surface (assets, via CDN — bucket itself private). Secrets and client data structurally cannot be public. Each concern has its own write principal, so a bug in one path can't expose another's data.
● Migration Runbook

Zero-downtime migration

  1. Done: public *.env blocked. Now: rotate the leaked mail password + the Culvert credentials/data exposure (D1, D2).
  2. Provision buckets (-data, -assets, -logs, -config, -ftp, -nonprod) — private, versioned, lifecycle as above. (IaC, not ClickOps.)
  3. Ship key_builder.py and route ALL writes through it → app starts writing canonical keys to the new buckets. Stops new mess immediately.
  4. Backfill + dedup per the variant map:
    aws s3 cp s3://stock-rulebook-prod/nasdaqgemx/  \
              s3://rulebook-prod-data/exchanges/nasdaq-gemx/ --recursive
    aws s3 cp s3://stock-rulebook-prod/nasdaq_gemx/ \
              s3://rulebook-prod-data/exchanges/nasdaq-gemx/ --recursive
    # repeat per source; old bucket kept READ-ONLY during transition
  5. Move assets → CloudFront OAC; repoint the CDN at rulebook-prod-assets; make the old prod bucket private.
  6. Evict demo/evals/test/FTP/staging to their buckets; delete junk (V_1, leading-slash keys, broken % log folders).
  7. Cut over reads to the new buckets; verify object counts/checksums; then decommission stock-rulebook-prod.
Owner split: App team — key_builder.py, write repoint, dedup job. DevOps — bucket IaC, CloudFront OAC, lifecycle, SFTP scoping. Sequencing: writes first (stop the bleed), backfill at your own pace, cut over reads last — no big-bang.
Codeaza Technologies · Rulebook S3 Data Store — Technical Specification · Account 058264491029 · 30 Jun 2026 · Confidential — internal