7 Data Sources
Covers GitHub, Hacker News, Reddit, YouTube, Product Hunt, X/Twitter, and 6 Chinese tech platforms in a single run.
Automated multi-source AI & tech intelligence for your agents
Overview
Info Pipeline aggregates 7 data sources across English and Chinese tech platforms every day. Each item passes through keyword filtering, relevance scoring, and deduplication before being serialized into a unified JSON schema β ready for your AI researcher agents to consume.
Capabilities
Covers GitHub, Hacker News, Reddit, YouTube, Product Hunt, X/Twitter, and 6 Chinese tech platforms in a single run.
Each item is scored against a configurable keyword list. Only high-signal content passes β no noise.
Cross-source deduplication ensures the same story from multiple platforms appears only once in your report.
All 7 collectors output the same JSON schema (title, url, source, score, summary, tags) for easy downstream processing.
Everything in config.yaml β keywords, per-source limits, score thresholds, enabled platforms. Change behavior without touching code.
Pipeline outputs structured Markdown reports consumed directly by AI researcher agents for further analysis.
Architecture
1. Collect Each collector runs independently and fetches raw items from its platform using the parameters in config.yaml. 2. Filter & Score Items are matched against global keywords and scored for relevance. Low-signal content is discarded before it reaches storage. 3. Deduplicate URL-based and title-similarity deduplication removes duplicates across sources β the same story won't appear twice. 4. Unify Schema All surviving items are normalized into a single JSON structure: title, url, source, score, published_at, summary, tags. 5. Report A structured Markdown report is written to reports/ β directly consumable by AI researcher agents for further analysis. CLI Usage
# Run all 7 sources
python main.py
# Run a single source
python main.py --source github
python main.py --source hackernews
python main.py --source reddit
# List available sources
python main.py --list
Output Schema (per item)
{
"title": "...",
"url": "https://...",
"source": "github",
"score": 85,
"published_at": "2026-02-24T...",
"summary": "...",
"tags": ["llm", "open-source"]
} Data Sources
English mainstream tech + Chinese ecosystem β both covered in a single pipeline run.
Hot repositories by topic (LLM, AI agent, RAG, MCP, diffusion) filtered by stars and recency.
Top stories filtered by score threshold β surface what the tech community is talking about today.
Multi-subreddit coverage: r/LocalLLaMA, r/MachineLearning, r/artificial, r/ChatGPT, r/ClaudeAI and more.
Latest uploads from top AI channels β Karpathy, Yannic Kilcher, Two Minute Papers, 3Blue1Brown, Fireship.
Daily new products in AI, Developer Tools, and Productivity β filtered by votes and topic.
Keyword search for AI/LLM discussions from the English-language tech community.
η₯δΉ Β· 36ζ°ͺ Β· ζι Β· ε°ζ°ζ΄Ύ Β· InfoQ Β· Bη«η§ζεΊ β via trends-hub MCP integration.
Coverage
Every content type that matters for AI & tech research β all in one daily run.
GitHub Repos
Stars, forks, topics, recency
HN Discussions
Score, comments, domain
Reddit Posts
Multi-subreddit, upvote filtered
YouTube Videos
Selected AI creator channels
Product Launches
Daily PH feed, vote threshold
Tweets
Keyword search, recent timeline
Chinese Tech News
η₯δΉ / 36ζ°ͺ / ζι / Bη« / InfoQ / ε°ζ°ζ΄Ύ
Coming Soon
Be the first to know when Info Pipeline launches as a managed OpenClaw plugin.