Observed bucket layout, the storage contract the application currently writes, concrete defects, and a target file/folder hierarchy with naming spec, access model, lifecycle, and a copy-level migration runbook.
Every exchange/regulator prefix follows the same 4-folder shape — this is the implicit contract, it's just not enforced or centralised.
# Observed under every exchange prefix (e.g. fee_cboe_us_options/, fee_nyse/) <source>/ ├── annotations/{YYYY-MM-DD}/{uuid32}.json # per-day annotation blobs ├── differences/{YYYY-MM-DD}/latest_annotated_*.pdf │ recent_annotated_*.pdf │ *_changes.html ├── parsed/ … # parsed source docs └── processed_chunks/ … # chunked text for embeddings/RAG
stock-rulebook-prod86 prefixes, one bucket, public-read. Annotated below with disposition: keep→data move evict/fix.
stock-rulebook-prod/ (PUBLIC read on /*) │ ├── ── market & regulatory DATA (~74 prefixes) ── keep → rulebook-prod-data │ ├── fee_nyse/ fee_nyse_american/ fee_nyse_arca/ fee_nyse_national/ … │ ├── fee_cboe_us_options/ fee_cboe_bzx_equities/ fee_cboe_europe/ … (10x cboe) │ ├── fee_miax_options/ fee_miax_pearl_equities/ … (5x miax) │ ├── fee_euronext/ fee_six_swiss_exchange/ fee_xetra_german/ fee_aquis/ … │ ├── fee_turquoise/ fee_turquoise_tghe/ fee_turquoise_tghl/ │ ├── nasdaq/ nasdaq_equity_fix/ nasdaq_options_fix/ nasdaqbx/ nasdaq_texas/ │ ├── nasdaq_gemx/ nasdaqgemx/ nasdaq_ise/ nasdaqise/ ◄ DUPLICATES │ ├── nasdaq_mrx/ nasdaqmrx/ nasdaq_phlx/ nasdaqphlx/ ◄ DUPLICATES │ ├── nyse/ nyse_equity_fix/ london_stock_exchange/ fee_london_stock_exchange/ │ ├── bloomberg/ bloomberg_mtf/ eurex/ jane_street/ lch_equity/ market_assess_mtf/ │ ├── fee_wierner_boerse/ fee_warshaw_exchange/ ◄ TYPOS baked into keys │ ├── fca_handbook/ finra/ sec/ reg_nfa/ # regulators │ └── fee-collections/ fee_extraction/ expected-outputs/ # pipeline working data │ ├── assets/avatars/avatar_1.png move → rulebook-prod-assets (CDN/OAC) ├── logs/{YYYY-MM-DD}/*_scraper_logs.log move → rulebook-prod-logs ├── api-logs/%2024-%17-%0/API_LOGS.* move + FIX broken date formatting ├── FTP_users/culvert_capital/*.xlsx EVICT — client trade data in PUBLIC bucket ├── Testing_FTP/ · demo/ · evals/ · feedback-tasks/ move → non-prod bucket ├── stock-rulebook-staging/ staging data inside prod → staging bucket ├── monthly_mail.env · weekly_mail.env secrets (blocked; rotate + move to Secrets Mgr) └── V_1 · "/" (leading-slash keys) junk / path bug — investigate + remove
| # | Defect | Evidence | Severity |
|---|---|---|---|
| D1 | Secrets in a public bucket | monthly_mail.env, weekly_mail.env (EMAIL+PASSWORD) anonymously downloadable | CRITICAL |
| D2 | Client trade data in a public bucket | FTP_users/culvert_capital/Culvert Capital Trades - RBC Final.xlsx | CRITICAL |
| D3 | Duplicate prefixes for one source (two code paths) | nasdaq_gemx/ & nasdaqgemx/ both hold full annotations/differences/parsed/processed_chunks (different dates) | HIGH |
| D4 | Broken date formatting in keys | api-logs/%2024-%17-%0/ — unsubstituted strftime placeholders (literal % in path) | HIGH |
| D5 | Leading-slash keys | a top-level / prefix → //… keys from bad path join | MED |
| D6 | Inconsistent prefix convention | fee_london_stock_exchange/ vs london_stock_exchange/; fee_ on some sources, bare on others | MED |
| D7 | Casing inconsistency | FTP_users/, Testing_FTP/ vs lowercase-snake everywhere else | LOW |
| D8 | Typos frozen into keys | fee_wierner_boerse (wiener), fee_warshaw_exchange (warsaw), london_stcok in a log filename | LOW |
| D9 | Spaces in object keys | Culvert Capital Trades - RBC Final.xlsx — breaks URLs/tooling | LOW |
Resolve each variant to a single canonical slug, copy objects, verify, delete the old prefix. Confirmed duplicates first; review-required pairs flagged.
| Variants found | Canonical slug | Action |
|---|---|---|
nasdaq_gemx/ + nasdaqgemx/ | nasdaq-gemx | merge (confirmed dup) |
nasdaq_ise/ + nasdaqise/ | nasdaq-ise | merge (confirmed dup) |
nasdaq_mrx/ + nasdaqmrx/ | nasdaq-mrx | merge (confirmed dup) |
nasdaq_phlx/ + nasdaqphlx/ | nasdaq-phlx | merge (confirmed dup) |
fee_london_stock_exchange/ + london_stock_exchange/ | london-stock-exchange | review then merge |
fee_nyse*/ vs nyse/ / nyse_equity_fix/ | nyse, nyse-equity-fix | likely distinct feeds — confirm |
fee_wierner_boerse/ | wiener-boerse | rename (typo) |
fee_warshaw_exchange/ | warsaw-exchange | rename (typo) |
nasdaqgemx has 2025-11, nasdaq_gemx has 2026-05), so the merge is mostly additive.One bucket per purpose; a single canonical key scheme inside the data bucket. Only the assets bucket is web-exposed (via CloudFront OAC — the bucket itself stays private).
rulebook-prod-data/ PRIVATE · versioned · IA@90d ├── exchanges/ │ └── {exchange-slug}/ # e.g. nasdaq-gemx, cboe-us-options, nyse-arca │ ├── annotations/{YYYY}/{MM}/{DD}/{uuid}.json │ ├── differences/{YYYY}/{MM}/{DD}/{file} │ ├── parsed/{YYYY}/{MM}/{DD}/{file} │ └── processed-chunks/{YYYY}/{MM}/{DD}/{file} ├── regulators/ │ └── {body-slug}/{annotations|differences|parsed}/{YYYY}/{MM}/{DD}/… # finra, sec, fca-handbook, reg-nfa └── pipeline/ ├── fee-collections/ … ├── fee-extraction/ … └── expected-outputs/ … # test fixtures → consider non-prod rulebook-prod-assets/ PRIVATE bucket, served ONLY via CloudFront OAC └── avatars/ · img/ · static/ rulebook-prod-logs/ PRIVATE · expire 30–90d ├── app/{YYYY}/{MM}/{DD}/… # was logs/ └── api/{YYYY}/{MM}/{DD}/… # was api-logs/ (with FIXED date formatting) rulebook-config/ PRIVATE — or migrate to AWS Secrets Manager └── mail/{monthly,weekly}.env # after password rotation rulebook-ftp/ PRIVATE · per-user IAM/SFTP scope └── {client-slug}/ … # was FTP_users/ — client trade data, locked down rulebook-nonprod/ PRIVATE · aggressive expiry └── demo/ · evals/ · feedback-tasks/ · testing-ftp/
# key_builder.py — the ONLY way the app may construct an S3 key import re from datetime import date def slug(name: str) -> str: # "NASDAQ GEMX" / "nasdaq_gemx" / "nasdaqgemx" -> "nasdaq-gemx" return re.sub(r'[^a-z0-9]+', '-', name.strip().lower()).strip('-') def data_key(source: str, stage: str, d: date, filename: str) -> str: assert stage in {"annotations","differences","parsed","processed-chunks"} key = f"exchanges/{slug(source)}/{stage}/{d:%Y/%m/%d}/{filename}" return validate(key) def validate(key: str) -> str: assert not key.startswith('/') # D5: no leading slash assert '//' not in key # D5: no empty segments assert '%' not in key # D4: no unsubstituted strftime assert ' ' not in key # D9: no spaces assert key == key.lower() # D7: lowercase only return key
/YYYY/MM/DD/ path segments (clean lifecycle + prefix listing).f"{x}/" string concat at call sites.%/uppercase as a backstop.| Bucket | Public? | Who writes | Who reads | Lifecycle |
|---|---|---|---|---|
rulebook-prod-data | no | app task-role (PutObject) | app + devs (GetObject) | versioning; noncurrent→IA@90d |
rulebook-prod-assets | CDN only | app | public via CloudFront OAC | — |
rulebook-prod-logs | no | app | devs/observability | expire @30–90d |
rulebook-config | no | admin only | app task-role | versioning |
rulebook-ftp | no | SFTP per-user | per-user scoped | policy-based |
rulebook-nonprod | no | app (non-prod) | devs | expire @30d |
*.env blocked. Now: rotate the leaked mail password + the Culvert credentials/data exposure (D1, D2).-data, -assets, -logs, -config, -ftp, -nonprod) — private, versioned, lifecycle as above. (IaC, not ClickOps.)key_builder.py and route ALL writes through it → app starts writing canonical keys to the new buckets. Stops new mess immediately.aws s3 cp s3://stock-rulebook-prod/nasdaqgemx/ \
s3://rulebook-prod-data/exchanges/nasdaq-gemx/ --recursive
aws s3 cp s3://stock-rulebook-prod/nasdaq_gemx/ \
s3://rulebook-prod-data/exchanges/nasdaq-gemx/ --recursive
# repeat per source; old bucket kept READ-ONLY during transitionrulebook-prod-assets; make the old prod bucket private.V_1, leading-slash keys, broken % log folders).stock-rulebook-prod.key_builder.py, write repoint, dedup job. DevOps — bucket IaC, CloudFront OAC, lifecycle, SFTP scoping. Sequencing: writes first (stop the bleed), backfill at your own pace, cut over reads last — no big-bang.