merge:V0.3.8 (#213)

* rss and search study

* 新增用於windows的入口py和env (#202)

* 新增windows可用的入口py和env文件

* 修改入口py的文件ˇ名以區分不同的操作系統

* 更新V0.3.7的windows入口py,刪除windows.env文件中的個人資訊

* bug fix: wechat-py

* feature:v0.3.8(rss search engine, focuspoint as task)

* v0.3.8 preview

* v0.3.8release

---------

Signed-off-by: bigbrother666 <96130569+bigbrother666sh@users.noreply.github.com>
Co-authored-by: c469591 <74401447+c469591@users.noreply.github.com>
This commit is contained in:
bigbrother666 2025-01-24 19:52:16 +08:00 committed by GitHub
parent 7534c520f7
commit b1ef7a23d1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
34 changed files with 947 additions and 508 deletions

1
.gitignore vendored
View File

@ -11,3 +11,4 @@ pb/pocketbase
core/docker_dir/
core/work_dir/
test/webpage_samples/
weixin_mp/work_dir/

View File

@ -1,3 +1,22 @@
# V0.3.8
- 增加对 RSS 信源的支持
add support for RSS source
- 支持为关注点指定信源,并且可以为每个关注点增加搜索引擎作为信源
support to specify source for each focus point, and add search engine as source
- 进一步优化信息提取策略(每次只处理一个关注点)
Further optimized information extraction strategy (processing one focus point at a time)
- 优化入口逻辑,简化并合并启动方案 (感谢 @c469591 贡献windows版本启动脚本
Optimized entry logic, simplified and merged startup solutions (thanks to @c469591 for contributing Windows startup script)
# V0.3.7
- 新增通过wxbot方案获取微信公众号订阅消息信源不是很优雅但已是目前能找到的最佳方案
@ -16,41 +35,59 @@
Provided a custom extractor interface to allow users to customize according to actual needs.
- bug 修复以及其他改进crawl4ai浏览器生命周期管理异步 llm wrapper 等)
- bug 修复以及其他改进crawl4ai浏览器生命周期管理异步 llm wrapper 等)(感谢 @tusik 贡献异步 llm wrapper
Bug fixes and other improvements (crawl4ai browser lifecycle management, asynchronous llm wrapper, etc.)
Thanks to @tusik for contributing the asynchronous LLM wrapper
# V0.3.6
- 改用 Crawl4ai 作为底层爬虫框架其实Crawl4ai 和 Crawlee 的获取效果差别不大,二者也都是基于 Playwright ,但 Crawl4ai 的 html2markdown 功能很实用而这对llm 信息提取作用很大,另外 Crawl4ai 的架构也更加符合我的思路;
Switched to Crawl4ai as the underlying web crawling framework. Although Crawl4ai and Crawlee both rely on Playwright with similar fetching results, Crawl4ai's html2markdown feature is quite practical for LLM information extraction. Additionally, Crawl4ai's architecture better aligns with my design philosophy.
- 在 Crawl4ai 的 html2markdown 基础上,增加了 deep scraper进一步把页面的独立链接与正文进行区分便于后一步 llm 的精准提取。由于html2markdown和deep scraper已经将原始网页数据做了很好的清理极大降低了llm所受的干扰和误导保证了最终结果的质量同时也减少了不必要的 token 消耗;
Built upon Crawl4ai's html2markdown, we added a deep scraper to further differentiate standalone links from the main content, facilitating more precise LLM extraction. The preprocessing done by html2markdown and deep scraper significantly cleans up raw web data, minimizing interference and misleading information for LLMs, ensuring higher quality outcomes while reducing unnecessary token consumption.
*列表页面和文章页面的区分是所有爬虫类项目都头痛的地方,尤其是现代网页往往习惯在文章页面的侧边栏和底部增加大量推荐阅读,使得二者几乎不存在文本统计上的特征差异。*
*这一块我本来想用视觉大模型进行 layout 分析,但最终实现起来发现获取不受干扰的网页截图是一件会极大增加程序复杂度并降低处理效率的事情……*
*Distinguishing between list pages and article pages is a common challenge in web scraping projects, especially when modern webpages often include extensive recommended readings in sidebars and footers of articles, making it difficult to differentiate them through text statistics.*
*Initially, I considered using large visual models for layout analysis, but found that obtaining undistorted webpage screenshots greatly increases program complexity and reduces processing efficiency...*
- 重构了提取策略、llm 的 prompt 等;
Restructured extraction strategies and LLM prompts;
*有关 prompt 我想说的是,我理解好的 prompt 是清晰的工作流指导,每一步都足够明确,明确到很难犯错。但我不太相信过于复杂的 prompt 的价值,这个很难评估,如果你有更好的方案,欢迎提供 PR*
- 引入视觉大模型,自动在提取前对高权重(目前由 Crawl4ai 评估权重)图片进行识别,并补充相关信息到页面文本中;
- 继续减少 requirement.txt 的依赖项,目前不需要 json_repair了实践中也发现让 llm 按 json 格式生成,还是会明显增加处理时间和失败率,因此我现在采用更简单的方式,同时增加对处理结果的后处理)
- pb info 表单的结构做了小调整,增加了 web_title 和 reference 两项。
- @ourines 贡献了 install_pocketbase.sh 脚本 (docker运行方案被暂时移除了感觉大家用起来也不是很方便……)
- Switched to Crawl4ai as the underlying web crawling framework. Although Crawl4ai and Crawlee both rely on Playwright with similar fetching results, Crawl4ai's html2markdown feature is quite practical for LLM information extraction. Additionally, Crawl4ai's architecture better aligns with my design philosophy.
- Built upon Crawl4ai's html2markdown, we added a deep scraper to further differentiate standalone links from the main content, facilitating more precise LLM extraction. The preprocessing done by html2markdown and deep scraper significantly cleans up raw web data, minimizing interference and misleading information for LLMs, ensuring higher quality outcomes while reducing unnecessary token consumption.
*Distinguishing between list pages and article pages is a common challenge in web scraping projects, especially when modern webpages often include extensive recommended readings in sidebars and footers of articles, making it difficult to differentiate them through text statistics.*
*Initially, I considered using large visual models for layout analysis, but found that obtaining undistorted webpage screenshots greatly increases program complexity and reduces processing efficiency...*
- Restructured extraction strategies and LLM prompts;
*Regarding prompts, I believe that a good prompt serves as clear workflow guidance, with each step being explicit enough to minimize errors. However, I am skeptical about the value of overly complex prompts, which are hard to evaluate. If you have better solutions, feel free to submit a PR.*
- Introduced large visual models to automatically recognize high-weight images (currently evaluated by Crawl4ai) before extraction and append relevant information to the page text;
- Continued to reduce dependencies in requirement.txt; json_repair is no longer needed (in practice, having LLMs generate JSON format still noticeably increases processing time and failure rates, so I now adopt a simpler approach with additional post-processing of results)
- Made minor adjustments to the pb info form structure, adding web_title and reference fields.
- @ourines contributed the install_pocketbase.sh script (the Docker running solution has been temporarily removed as it wasn't very convenient for users...)
- 引入视觉大模型,自动在提取前对高权重(目前由 Crawl4ai 评估权重)图片进行识别,并补充相关信息到页面文本中;
Introduced large visual models to automatically recognize high-weight images (currently evaluated by Crawl4ai) before extraction and append relevant information to the page text;
- 继续减少 requirement.txt 的依赖项,目前不需要 json_repair了实践中也发现让 llm 按 json 格式生成,还是会明显增加处理时间和失败率,因此我现在采用更简单的方式,同时增加对处理结果的后处理)
Continued to reduce dependencies in requirement.txt; json_repair is no longer needed (in practice, having LLMs generate JSON format still noticeably increases processing time and failure rates, so I now adopt a simpler approach with additional post-processing of results)
- pb info 表单的结构做了小调整,增加了 web_title 和 reference 两项。
Made minor adjustments to the pb info form structure, adding web_title and reference fields.
- @ourines 贡献了 install_pocketbase.sh 脚本
@ourines contributed the install_pocketbase.sh script
- @ibaoger 贡献了 windows 下的pocketbase 安装脚本
@ibaoger contributed the pocketbase installation script for Windows
- docker运行方案被暂时移除了感觉大家用起来也不是很方便……
Docker running solution has been temporarily removed as it wasn't very convenient for users...
# V0.3.5
- 引入 Crawlee(playwrigt模块),大幅提升通用爬取能力,适配实际项目场景;

142
README.md
View File

@ -1,8 +1,8 @@
# 首席情报官Wiseflow
# AI首席情报官Wiseflow
**[English](README_EN.md) | [日本語](README_JP.md) | [한국어](README_KR.md)**
🚀 **AI情报官**Wiseflow是一个敏捷的信息挖掘工具可以从各种给定信源中依靠大模型的思考与分析能力精准抓取特定信息全程无需人工参与。
🚀 **AI首席情报官**Wiseflow是一个敏捷的信息挖掘工具可以从各种给定信源中依靠大模型的思考与分析能力精准抓取特定信息全程无需人工参与。
**我们缺的不是信息,而是从海量信息中过滤噪音,从而让有价值的信息显露出来**
@ -10,25 +10,30 @@
https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
## 🔥 V0.3.7 来了
## 🔥 V0.3.8 正式发布
本次升级带来了 wxbot 的整合方案,方便大家添加微信公众号作为信源,具体见 [weixin_mp/README.md](./weixin_mp/README.md)
- V0.3.8版本引入对 RSS、搜索引擎的支持现在 wiseflow 支持 _网站_、_rss_、_搜索引擎_ 和 _微信公众号_ 四类信源啦!
我们也提供了专门针对微信公众号文章的提取器,同时也设计了自定义提取器接口,方便用户根据实际需求进行定制
- 产品策略上改为按关注点指定信源,也就是可以为不同信源指定不同关注点了,实测同等模型下可以进一步提高信息提取的准确率
本次升级也进一步强化了信息提取能力不仅极大优化了页面中链接的分析还使得7b、14b 这种规模的模型也能比较好的完成基于复杂关注点explanation中包含时间、指标限制这种的提取
- 优化入口程序,为 MacOS&Linux 和 Windows 用户分别提供单一的启动脚本,方便大家使用
另外本次升级还适配了 Crawl4ai 0.4.247 版本,以及做了诸多程序改进,具体见 [CHANGELOG.md](./CHANGELOG.md)
有关本次升级更多内容请见 [CHANGELOG.md](./CHANGELOG.md)
感谢如下社区贡献者在这一阶段的 PR
**V0.3.8版本搜索引擎使用智谱bigmodel开放平台提供的服务需要在 .env 中增加ZHIPU_API_KEY**
- @ourines 贡献了 install_pocketbase.sh 脚本 (docker运行方案被暂时移除了感觉大家用起来也不是很方便……)
- @ibaoger 贡献了 windows 下的pocketbase 安装脚本
- @tusik 贡献了异步 llm wrapper
**V0.3.8版本对pocketbase的表单结构做了调整老用户请先在 pb 文件夹下执行一次 ./pocketbase migrate**
**V0.3.7版本再次引入SECONDARY_MODEL这主要是为了降低使用成本**
V0.3.8是一个稳定版本,原计划的 V0.3.9 需要积累更多社区的反馈以决定升级方向,因此需要等待较长时间。
### V0.3.7 测试报告
感谢如下社区成员在 V0.3.5~V0.3.8 版本中的 PR
- @ourines 贡献了 install_pocketbase.sh自动化安装脚本
- @ibaoger 贡献了 windows下的pocketbase自动化安装脚本
- @tusik 贡献了异步 llm wrapper 同时发现了AsyncWebCrawler生命周期的问题
- @c469591 贡献了 windows版本启动脚本
### 🌟测试报告
在最新的提取策略下我们发现7b 这种规模的模型也能很好的执行链接分析与提取任务,测试结果请参考 [report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)
@ -39,12 +44,6 @@ https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
现阶段,**提交测试结果等同于提交项目代码**同样会被接纳为contributor甚至受邀参加商业化项目具体请参考 [test/README.md](./test/README.md)
🌟**V0.3.x 计划**
- ~~尝试支持微信公众号免wxbot订阅V0.3.7);【已完成】~~
- 引入对 RSS 信息源和搜索引擎的支持V0.3.8;
- 尝试部分支持社交平台V0.3.9)。
## ✋ wiseflow 与传统的爬虫工具、AI搜索、知识库RAG项目有何不同
wiseflow自2024年6月底发布 V0.3.0版本来受到了开源社区的广泛关注,甚至吸引了不少自媒体的主动报道,在此首先表示感谢!
@ -104,12 +103,12 @@ wiseflow 是 LLM 原生应用,请务必保证为程序提供稳定的 LLM 服
siliconflow硅基流动提供大部分主流开源模型的在线 MaaS 服务,凭借着自身的加速推理技术积累,其服务速度和价格方面都有很大优势。使用 siliconflow 的服务时,.env的配置可以参考如下
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://api.siliconflow.cn/v1"
export PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
export SECONDARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
export VL_MODEL="OpenGVLab/InternVL2-26B"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://api.siliconflow.cn/v1"
PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
SECONDARY_MODEL="Qwen/Qwen2.5-14B-Instruct"
VL_MODEL="OpenGVLab/InternVL2-26B"
```
😄 如果您愿意,可以使用我的[siliconflow邀请链接](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92)这样我也可以获得更多token奖励 🌹
@ -119,12 +118,12 @@ export VL_MODEL="OpenGVLab/InternVL2-26B"
如果您的信源多为非中文页面,且也不要求提取出的 info 为中文,那么更推荐您使用 openai、claude、gemini 等海外闭源商业模型。您可以尝试第三方代理 **AiHubMix**,支持国内网络环境直连、支付宝便捷支付,免去封号风险。
使用 AiHubMix 的模型时,.env的配置可以参考如下
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://aihubmix.com/v1" # 具体参考 https://doc.aihubmix.com/
export PRIMARY_MODEL="gpt-4o"
export SECONDARY_MODEL="gpt-4o-mini"
export VL_MODEL="gpt-4o"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://aihubmix.com/v1" # 具体参考 https://doc.aihubmix.com/
PRIMARY_MODEL="gpt-4o"
SECONDARY_MODEL="gpt-4o-mini"
VL_MODEL="gpt-4o"
```
😄 欢迎使用 [AiHubMix邀请链接](https://aihubmix.com?aff=Gp54) 注册 🌹
@ -133,22 +132,31 @@ export VL_MODEL="gpt-4o"
以 Xinference 为例,.env 配置可以参考如下:
```bash
```
# LLM_API_KEY='' 本地服务无需这一项,请注释掉或删除
export LLM_API_BASE='http://127.0.0.1:9997'
export PRIMARY_MODEL=启动的模型 ID
export VL_MODEL=启动的模型 ID
LLM_API_BASE='http://127.0.0.1:9997'
PRIMARY_MODEL=启动的模型 ID
VL_MODEL=启动的模型 ID
```
#### 3.2 pocketbase 账号密码配置
```bash
export PB_API_AUTH="test@example.com|1234567890"
```
PB_API_AUTH="test@example.com|1234567890"
```
这里pocketbase 数据库的 superuser 用户名和密码,记得用 | 分隔 (如果 install_pocketbase.sh 脚本执行成功,这一项应该已经存在了)
#### 3.3 其他可选配置
#### 3.3 智谱bigmodel平台key设置用于搜索引擎服务
```
ZHIPU_API_KEY=Your_API_KEY
```
申请地址https://bigmodel.cn/ 目前免费)
#### 3.4 其他可选配置
下面的都是可选配置:
- #VERBOSE="true"
@ -167,12 +175,9 @@ export PB_API_AUTH="test@example.com|1234567890"
用于控制 llm 的并发请求数量不设定默认是1开启前请确保 llm provider 支持设定的并发,本地大模型慎用,除非你对自己的硬件基础有信心)
感谢 @tusik 贡献的异步 llm wrapper
### 4. 运行程序
✋ V0.3.5版本架构和依赖与之前版本有较大不同请务必重新拉取代码删除或重建pb_data
推荐使用 conda 构建虚拟环境(当然你也可以忽略这一步,或者使用其他 python 虚拟环境方案)
```bash
@ -186,40 +191,50 @@ conda activate wiseflow
cd wiseflow
cd core
pip install -r requirements.txt
chmod +x run.sh
./run_task.sh # if you just want to scan sites one-time (no loop), use ./run.sh
```
🌟 这个脚本会自动判断 pocketbase 是否已经在运行,如果未运行,会自动拉起。但是请注意,当你 ctrl+c 或者 ctrl+z 终止进程时pocketbase 进程不会被终止直到你关闭terminal。
之后 MacOS&Linux 用户执行
run_task.sh 会周期性执行爬取-提取任务(启动时会立即先执行一次,之后每隔一小时启动一次), 如果仅需执行一次,可以使用 run.sh 脚本。
```bash
chmod +x run.sh
./run.sh
```
### 5. **关注点和定时扫描信源添加**
Windows 用户执行
```bash
python windows_run.py
```
以上脚本会自动判断 pocketbase 是否已经在运行,如果未运行,会自动拉起。但是请注意,当你 ctrl+c 或者 ctrl+z 终止进程时pocketbase 进程不会被终止直到你关闭terminal。
run.sh 会先对所有已经激活activated 设定为 true的信源执行一次爬取任务之后以小时为单位按设定的频率周期执行。
### 5. 关注点和信源添加
启动程序后打开pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)
#### 5.1 打开 focus_point 表单
#### 5.1 打开 sites表单
通过这个表单可以配置信源,注意:信源需要在下一步的 focus_point 表单中被选择。
sites 字段说明:
- url, 信源的url信源无需给定具体文章页面给文章列表页面即可。
- type, 类型web 或者 rss。
#### 5.2 打开 focus_point 表单
通过这个表单可以指定你的关注点LLM会按此提炼、过滤并分类信息。
字段说明:
- focuspoint, 关注点描述(必填),如”上海小升初信息“、”加密货币价格“
- explanation关注点的详细解释或具体约定如 “仅限上海市官方发布的初中升学信息”、“BTC、ETH 的现价、涨跌幅数据“等
- activated, 是否激活。如果关闭则会忽略该关注点,关闭后可再次开启。
注意focus_point 更新设定(包括 activated 调整)后,**需要重启程序才会生效。**
#### 5.2 打开 sites表单
通过这个表单可以指定自定义信源,系统会启动后台定时任务,在本地执行信源扫描、解析和分析。
sites 字段说明:
- url, 信源的url信源无需给定具体文章页面给文章列表页面即可。
- per_hours, 扫描频率单位为小时类型为整数1~24范围我们建议扫描频次不要超过一天一次即设定为24
- activated, 是否激活。如果关闭则会忽略该信源,关闭后可再次开启。
**sites 的设定调整,无需重启程序。**
- focuspoint, 关注点描述(必填),如”上海小升初信息“、”招标通知“
- explanation关注点的详细解释或具体约定如 “仅限上海市官方发布的初中升学信息”、“发布日期在2025年1月1日之后且金额100万以上的“等
- activated, 是否激活。如果关闭则会忽略该关注点,关闭后可再次开启
- per_hour, 爬取频率单位为小时类型为整数1~24范围我们建议扫描频次不要超过一天一次即设定为24
- search_engine, 每次爬取是否开启搜索引擎
- sites选择对应的信源
**注意V0.3.8版本后,配置的调整无需重启程序,会在下一次执行时自动生效。**
## 📚 如何在您自己的程序中使用 wiseflow 抓取出的数据
@ -255,6 +270,7 @@ PocketBase作为流行的轻量级数据库目前已有 Go/Javascript/Python
- crawl4aiOpen-source LLM Friendly Web Crawler & Scraper https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase
- feedparser (Parse feeds in Python) https://github.com/kurtmckee/feedparser
本项目开发受 [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor)、[AutoCrawler](https://github.com/kingname/AutoCrawler) 、[SeeAct](https://github.com/OSU-NLP-Group/SeeAct) 启发。

View File

@ -1,8 +1,8 @@
# Chief Intelligence Officer (Wiseflow)
# AI Chief Intelligence Officer (Wiseflow)
**[简体中文](README.md) | [日本語](README_JP.md) | [한국어](README_KR.md)**
🚀 **AI Intelligence Officer** (Wiseflow) is an agile information mining tool that can precisely extract specific information from various given sources by leveraging the thinking and analytical capabilities of large models, requiring no human intervention throughout the process.
🚀 **AI Chief Intelligence Officer** (Wiseflow) is an agile information mining tool that can precisely extract specific information from various given sources by leveraging the thinking and analytical capabilities of large models, requiring no human intervention throughout the process.
**What we lack is not information, but the ability to filter out noise from massive information, thereby revealing valuable information.**
@ -10,25 +10,30 @@
https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
## 🔥 V0.3.7 is Here
## 🔥 V0.3.8 Officially Released
This upgrade brings wxbot integration solution, making it convenient for everyone to add WeChat Official Accounts as information sources. For details, see [weixin_mp/README.md](./weixin_mp/README.md)
- Version V0.3.8 introduces support for RSS and search engines. Now wiseflow supports four types of information sources: _websites_, _rss_, _search engines_, and _WeChat Official Accounts_!
We have also provided extractors specifically designed for WeChat Official Account articles, while also designing custom extractor interfaces to allow users to customize according to their actual needs.
- The product strategy has been changed to specify information sources based on focus points, meaning different focus points can be assigned to different information sources. Tests show this can further improve information extraction accuracy with the same model.
This upgrade further strengthens information extraction capabilities, not only greatly optimizing the analysis of links within pages but also enabling models of 7b and 14b scale to better complete extractions based on complex focus points (such as those containing time and metric restrictions in explanations).
- The entry program has been optimized, providing single startup scripts for both MacOS/Linux and Windows users for easier usage.
Additionally, this upgrade adapts to Crawl4ai version 0.4.247 and makes many program improvements. For details, see [CHANGELOG.md](./CHANGELOG.md)
For more details about this upgrade, please see [CHANGELOG.md](./CHANGELOG.md)
Thanks to the following community contributors for their PRs during this phase:
**V0.3.8 version uses the service provided by Zhipu bigmodel open platform for search engine functionality. You need to add ZHIPU_API_KEY in .env file**
- @ourines contributed the install_pocketbase.sh script (docker running solution has been temporarily removed as it wasn't very convenient for users...)
- @ibaoger contributed the pocketbase installation script for Windows
- @tusik contributed the asynchronous llm wrapper
**V0.3.8 version has made adjustments to the PocketBase form structure. Existing users should execute ./pocketbase migrate once in the pb folder**
**V0.3.7 version reintroduces SECONDARY_MODEL, mainly to reduce usage costs**
V0.3.8 is a stable version. The originally planned V0.3.9 needs to accumulate more community feedback to determine the upgrade direction, so it will take longer to release.
### V0.3.7 Test Report
Thanks to the following community members for their PRs in versions V0.3.5~V0.3.8:
- @ourines contributed the install_pocketbase.sh automated installation script
- @ibaoger contributed the PocketBase automated installation script for Windows
- @tusik contributed the asynchronous llm wrapper and discovered the AsyncWebCrawler lifecycle issue
- @c469591 contributed the Windows version startup script
### 🌟 Test Report
Under the latest extraction strategy, we found that models of 7b scale can also perform link analysis and extraction tasks well. For test results, please refer to [report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)
@ -38,14 +43,6 @@ We continue to welcome more test results to jointly explore the best usage solut
At this stage, **submitting test results is equivalent to submitting project code**, and will similarly be accepted as a contributor, and may even be invited to participate in commercialization projects! For details, please refer to [test/README.md](./test/README.md)
🌟**V0.3.x Roadmap**
- ~~Attempt to support WeChat Official Account subscription without wxbot (V0.3.7);Done~~
- Introduce support for RSS feeds and search engines (V0.3.8);
- Attempt partial support for social platforms (V0.3.9).
## ✋ How is wiseflow Different from Traditional Crawler Tools, AI Search, and Knowledge Base (RAG) Projects?
Since the release of version V0.3.0 in late June 2024, wiseflow has received widespread attention from the open-source community, attracting even some self-media reports. First of all, we would like to express our gratitude!
@ -105,12 +102,12 @@ Wiseflow is a LLM native application, so please ensure you provide stable LLM se
Siliconflow provides online MaaS services for most mainstream open-source models. With its accumulated acceleration inference technology, its service has great advantages in both speed and price. When using siliconflow's service, the .env configuration can refer to the following:
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://api.siliconflow.cn/v1"
export PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
export SECONDARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
export VL_MODEL="OpenGVLab/InternVL2-26B"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://api.siliconflow.cn/v1"
PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
SECONDARY_MODEL="Qwen/Qwen2.5-14B-Instruct"
VL_MODEL="OpenGVLab/InternVL2-26B"
```
😄 If you'd like, you can use my [siliconflow referral link](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92), which will help me earn more token rewards 🌹
@ -121,12 +118,12 @@ If your information sources are mostly non-Chinese pages and you don't require t
When using AiHubMix models, the .env configuration can refer to the following:
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://aihubmix.com/v1" # refer to https://doc.aihubmix.com/
export PRIMARY_MODEL="gpt-4o"
export SECONDARY_MODEL="gpt-4o-mini"
export VL_MODEL="gpt-4o"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://aihubmix.com/v1" # refer to https://doc.aihubmix.com/
PRIMARY_MODEL="gpt-4o"
SECONDARY_MODEL="gpt-4o-mini"
VL_MODEL="gpt-4o"
```
😄 Welcome to register using the [AiHubMix referral link](https://aihubmix.com?aff=Gp54) 🌹
@ -135,22 +132,31 @@ export VL_MODEL="gpt-4o"
Taking Xinference as an example, the .env configuration can refer to the following:
```bash
```
# LLM_API_KEY='' no need for local service, please comment out or delete
export LLM_API_BASE='http://127.0.0.1:9997'
export PRIMARY_MODEL=launched_model_id
export VL_MODEL=launched_model_id
LLM_API_BASE='http://127.0.0.1:9997'
PRIMARY_MODEL=launched_model_id
VL_MODEL=launched_model_id
```
#### 3.2 Pocketbase Account and Password Configuration
```bash
export PB_API_AUTH="test@example.com|1234567890"
```
PB_API_AUTH="test@example.com|1234567890"
```
This is where you set the superuser username and password for the pocketbase database, remember to separate them with | (if the install_pocketbase.sh script executed successfully, this should already exist)
#### 3.3 Other Optional Configurations
#### 3.3 Zhipu (bigmodel) Platform Key Configuration (for Search Engine Services)
```
ZHIPU_API_KEY=Your_API_KEY
```
(Application here: https://bigmodel.cn/ currently free)
#### 3.4 Other Optional Configurations
The following are all optional configurations:
- #VERBOSE="true"
@ -169,8 +175,6 @@ The following are all optional configurations:
Used to control the number of concurrent LLM requests. Default is 1 if not set (before enabling, please ensure your LLM provider supports the configured concurrency. Use local large models with caution unless you are confident in your hardware capabilities)
Thanks to @tusik for contributing the asynchronous LLM wrapper
### 4. Running the Program
✋ The V0.3.5 version architecture and dependencies are significantly different from previous versions. Please make sure to re-pull the code, delete (or rebuild) pb_data
@ -188,40 +192,50 @@ then
cd wiseflow
cd core
pip install -r requirements.txt
chmod +x run.sh
./run_task.sh # if you just want to scan sites one-time (no loop), use ./run.sh
```
🌟 This script will automatically determine if pocketbase is already running. If not, it will automatically start. However, please note that when you terminate the process with ctrl+c or ctrl+z, the pocketbase process will not be terminated until you close the terminal.
Afterwards, MacOS&Linux users execute
run_task.sh will periodically execute crawling-extraction tasks (it will execute immediately at startup, then every hour after that). If you only need to execute once, you can use the run.sh script.
```bash
chmod +x run.sh
./run.sh
```
### 5. **Adding Focus Points and Scheduled Scanning of Information Sources**
Windows users execute
```bash
python windows_run.py
```
- This script will automatically determine if pocketbase is already running, and if not, it will automatically start it. However, please note that when you terminate the process with ctrl+c or ctrl+z, the pocketbase process will not be terminated until you close the terminal.
- run.sh will first execute a crawling task for all sources that have been activated (set to true), and then execute periodically at the set frequency in hours.
### 5. Focus Points and Source Addition
After starting the program, open the pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)
#### 5.1 Open the focus_point Form
#### 5.1 Opening the Sites Form
Through this form, you can specify your focus points, and LLM will refine, filter, and categorize information accordingly.
This form allows you to configure sources, noting that sources must be selected in the subsequent focus_point form.
Field description:
- focuspoint, focus point description (required), such as "Shanghai elementary to junior high school information," "cryptocurrency prices"
- explanation, detailed explanation or specific conventions of the focus point, such as "Only official junior high school admission information released by Shanghai" or "Current price, price change data of BTC, ETH"
- activated, whether to activate. If closed, this focus point will be ignored, and it can be re-enabled later.
Sites field explanations:
- url, the URL of the source, the source does not need to provide specific article pages, just the article list pages.
- type, the type, either web or rss.
Note: After updating the focus_point settings (including activated adjustments), **the program needs to be restarted for the changes to take effect.**
#### 5.2 Opening the Focus Point Form
#### 5.2 Open the sites Form
This form allows you to specify your focus points, and LLM will refine, filter, and categorize information based on this.
Through this form, you can specify custom information sources. The system will start background scheduled tasks to scan, parse, and analyze the information sources locally.
sites field description:
- url, the URL of the information source. The information source does not need to be given a specific article page, just the article list page.
- per_hours, scanning frequency, in hours, integer type (1~24 range, we recommend not exceeding once a day, i.e., set to 24)
- activated, whether to activate. If closed, this information source will be ignored, and it can be re-enabled later.
**Adjustments to sites settings do not require restarting the program.**
Field explanations:
- focuspoint, focus point description (required), such as "Christmas holiday discount information" or "bid announcement"
- explanation, detailed explanation or specific convention of the focus point, such as "xx series products" or "published after January 1, 2025, and with an amount over 10 million" etc.
- activated, whether to activate. If closed, this focus point will be ignored, and can be reopened after closing.
- per_hour, crawl frequency, in hours, integer type (1-24 range, we recommend not to exceed once a day, i.e., set to 24)
- search_engine, whether to enable the search engine for each crawl
- sites, select the corresponding source
**Note: After version V0.3.8, adjustments to configurations do not require restarting the program, and will automatically take effect at the next execution.**
## 📚 How to Use the Data Crawled by wiseflow in Your Own Program
@ -257,6 +271,7 @@ If you have any questions or suggestions, please feel free to leave a message vi
- crawl4aiOpen-source LLM Friendly Web Crawler & Scraper https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase
- feedparser (Parse feeds in Python) https://github.com/kurtmckee/feedparser
Also inspired by [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor) [AutoCrawler](https://github.com/kingname/AutoCrawler) [SeeAct](https://github.com/OSU-NLP-Group/SeeAct) .

View File

@ -1,8 +1,8 @@
# 首席情報官Wiseflow
# AIチーフインテリジェンスオフィサーWiseflow
**[English](README_EN.md) | [简体中文](README.md) | [한국어](README_KR.md)**
🚀 **AI情報官**Wiseflowは、大規模言語モデルの思考・分析能力を活用して、様々な情報源から特定の情報を正確に抽出できる俊敏な情報マイニングツールです。プロセス全体で人間の介入を必要としません。
🚀 **AIチーフインテリジェンスオフィサー**Wiseflowは、大規模言語モデルの思考・分析能力を活用して、様々な情報源から特定の情報を正確に抽出できる俊敏な情報マイニングツールです。プロセス全体で人間の介入を必要としません。
**私たちが欠けているのは情報ではなく、大量の情報からノイズをフィルタリングし、価値ある情報を明らかにすることです**
@ -10,25 +10,30 @@
https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
## 🔥 V0.3.7 がリリースされました
## 🔥 V0.3.8 正式リリース
今回のアップグレードでは、WeChat公式アカウントを情報源として追加できるwxbotの統合ソリューションを提供しました。詳細は[weixin_mp/README.md](./weixin_mp/README.md)をご覧ください。
- V0.3.8バージョンではRSSと検索エンジンのサポートを導入し、現在wiseflowは_ウェブサイト_、_RSS_、_検索エンジン_、_WeChat公式アカウント_の4種類の情報源をサポートしています
また、WeChat公式アカウントの記事専用の抽出ツールを提供し、ユーザーが実際のニーズに応じてカスタマイズできるカスタム抽出インターフェースも設計しました
- 製品戦略を変更し、関心点ごとに情報源を指定できるようになりました。これにより、同じモデルでも情報抽出の精度をさらに向上させることが実証されています
今回のアップグレードでは、情報抽出機能もさらに強化されました。ページ内のリンク分析を大幅に最適化しただけでなく、7b、14bクラスのモデルでも複雑な注目点explanationに時間や指標の制限などを含むに基づく抽出を適切に実行できるようになりました。
- エントリープログラムを最適化し、MacOS/LinuxとWindowsユーザーそれぞれに単一の起動スクリプトを提供し、使いやすさを向上させました。
さらに、今回のアップグレードではCrawl4ai 0.4.247バージョンに対応し、多くプログラム改善を行いました。詳細は[CHANGELOG.md](./CHANGELOG.md)をご覧ください。
今回のアップグレードの詳細について[CHANGELOG.md](./CHANGELOG.md)をご覧ください。
この段階で以下のコミュニティ貢献者のPRに感謝いたします
**V0.3.8バージョンの検索エンジンは、智譜bigmodelオープンプラットフォームのサービスを使用しています。.envファイルにZHIPU_API_KEYを追加する必要があります**
- @ourines がinstall_pocketbase.shスクリプトを提供dockerでの実行方法は一時的に削除されました。使いにくいと感じられたため...
- @ibaoger がWindows用のpocketbaseインストールスクリプトを提供
- @tusik が非同期llmラッパーを提供
**V0.3.8バージョンではpocketbaseのフォーム構造を調整しました。既存のユーザーは一度./pocketbase migrateを実行してください pbフォルダ内**
**V0.3.7バージョンでSECONDARY_MODELを再導入しました。これは主に使用コストを削減するためです**
V0.3.8は安定版です。当初計画されていたV0.3.9は、コミュニティからのフィードバックをさらに蓄積してアップグレードの方向性を決定する必要があるため、リリースまでに時間がかかります。
### V0.3.7 テストレポート
以下のコミュニティメンバーに、V0.3.5V0.3.8バージョンでのPRに感謝します
- @ourines install_pocketbase.sh自動インストールスクリプトの提供
- @ibaoger Windows用pocketbase自動インストールスクリプトの提供
- @tusik 非同期llm wrapperの提供とAsyncWebCrawlerのライフサイクル問題の発見
- @c469591 Windows版起動スクリプトの提供
### 🌟 テストレポート
最新の抽出戦略では、7bクラスのモデルでもリンク分析と抽出タスクを適切に実行できることがわかりました。テスト結果は[report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)をご参照ください。
@ -39,13 +44,6 @@ https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
現段階では、**テスト結果の提出はプロジェクトコードの提出と同等**とみなされ、contributorとして受け入れられ、商業化プロジェクトへの参加招待も受けられます詳細は[test/README.md](./test/README.md)をご参照ください。
🌟**V0.3.x プラン**
- ~~WeChat公式アカウントのwxbotなしでの購読をサポートするV0.3.7done~~
- RSS情報源と検索エンジンのサポートを導入するV0.3.8;
- 部分的なソーシャルプラットフォームのサポートを試みるV0.3.9)。
## ✋ wiseflow と従来のクローラーツール、AI検索、知識ベースRAGプロジェクトの違いは何ですか
wiseflowは2024年6月末にV0.3.0バージョンをリリースして以来、オープンソースコミュニティから広く注目を集めており、さらに多くのメディアが自発的に報道してくれました。ここでまず感謝の意を表します!
@ -105,12 +103,12 @@ Wiseflowは LLMネイティブアプリケーションですので、プログ
Siliconflowは、主流のオープンソースモデルのほとんどにオンラインMaaSサービスを提供しています。蓄積された推論加速技術により、そのサービスは速度と価格の両面で大きな利点があります。siliconflowのサービスを使用する場合、.envの設定は以下を参考にしてください
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://api.siliconflow.cn/v1"
export PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
export SECONDARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
export VL_MODEL="OpenGVLab/InternVL2-26B"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://api.siliconflow.cn/v1"
PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
SECONDARY_MODEL="Qwen/Qwen2.5-14B-Instruct"
VL_MODEL="OpenGVLab/InternVL2-26B"
```
😄 よろしければ、私の[siliconflow紹介リンク](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92)をご利用ください。これにより、私もより多くのトークン報酬を獲得できます 🌹
@ -121,12 +119,12 @@ export VL_MODEL="OpenGVLab/InternVL2-26B"
AiHubMixモデルを使用する場合、.envの設定は以下を参考にしてください
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://aihubmix.com/v1" # referhttps://doc.aihubmix.com/
export PRIMARY_MODEL="gpt-4o"
export SECONDARY_MODEL="gpt-4o-mini"
export VL_MODEL="gpt-4o"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://aihubmix.com/v1" # referhttps://doc.aihubmix.com/
PRIMARY_MODEL="gpt-4o"
SECONDARY_MODEL="gpt-4o-mini"
VL_MODEL="gpt-4o"
```
😄 [AiHubMixの紹介リンク](https://aihubmix.com?aff=Gp54)からご登録いただけますと幸いです 🌹
@ -134,22 +132,30 @@ export VL_MODEL="gpt-4o"
Xinferenceを例にすると、.envの設定は以下を参考にできます
```bash
```
# LLM_API_KEY='' no need for local service, please comment out or delete
export LLM_API_BASE='http://127.0.0.1:9997'
export PRIMARY_MODEL=launched_model_id
export VL_MODEL=launched_model_id
LLM_API_BASE='http://127.0.0.1:9997'
PRIMARY_MODEL=launched_model_id
VL_MODEL=launched_model_id
```
#### 3.2 Pocketbaseのアカウントとパスワードの設定
```bash
export PB_API_AUTH="test@example.com|1234567890"
```
PB_API_AUTH="test@example.com|1234567890"
```
これはpocketbaseデータベースのスーパーユーザー名とパスワードを設定する場所です。|で区切ることを忘れないでくださいinstall_pocketbase.shスクリプトが正常に実行された場合、これは既に存在しているはずです
#### 3.3 その他のオプション設定
#### 3.3 智谱bigmodelプラットフォームキーの設定検索エンジンサービスに使用
```
ZHIPU_API_KEY=Your_API_KEY
```
申請先https://bigmodel.cn/ 現在無料)
#### 3.4 その他のオプション設定
以下はすべてオプションの設定です:
- #VERBOSE="true"
@ -168,8 +174,6 @@ export PB_API_AUTH="test@example.com|1234567890"
llm の同時リクエスト数を制御するために使用されます。デフォルトは1ですllm provider が設定された同時性をサポートしていることを確認してください。ローカル大規模モデルはハードウェアベースに自分がない限り慎重に使用してください)
@tusik に感謝します
### 4. プログラムの実行
✋ V0.3.5バージョンのアーキテクチャと依存関係は以前のバージョンと大きく異なります。必ずコードを再取得し、pb_dataを削除または再構築してください。
@ -187,40 +191,50 @@ conda activate wiseflow
cd wiseflow
cd core
pip install -r requirements.txt
chmod +x run.sh
./run_task.sh # if you just want to scan sites one-time (no loop), use ./run.sh
```
🌟 このスクリプトは、pocketbaseが既に実行されているかどうかを自動的に判断します。実行されていない場合は自動的に起動します。ただし、ctrl+cまたはctrl+zでプロセスを終了した場合、ターミナルを閉じるまでpocketbaseプロセスは終了しないことに注意してください。
その後、MacOS&Linuxユーザーは実行します
run_task.shは定期的にクローリング・抽出タスクを実行します起動時に即座に実行され、その後1時間ごとに実行されます。1回だけ実行する必要がある場合は、run.shスクリプトを使用できます。
```bash
chmod +x run.sh
./run.sh
```
Windowsユーザーは実行します
### 5. **関心事と定期的なスキャン情報源の追加**
```bash
python windows_run.py
```
プログラムを起動した後、pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/) を開きます
- このスクリプトはpocketbaseが既に実行されているかどうかを自動的に判断し、実行されていない場合は自動で起動します。ただし、注意してください、ctrl+cまたはctrl+zでプロセスを終了させた場合、pocketbaseプロセスは終了しないため、terminalを閉じるまでです。
#### 5.1 focus_point フォームを開く
- run.shはまず、activatedがtrueに設定されたすべての信源に対して一度のクロールタスクを実行し、その後、設定された頻率で時間単位で周期的に実行します。
このフォームを使用して、あなたの関心事を指定できます。LLMはこれに基づいて情報を抽出、フィルタリング、分類します。
### 5. フォーカスポイントと情報源の追加
フィールド説明:
- focuspoint, 関心事の説明(必須)、例えば「上海の小学校から中学校への情報」、「暗号通貨価格」
- explanation関心事の詳細な説明または具体的な約束、例えば「上海市公式発表の中学校進学情報のみ」、「BTC、ETHの現在価格、変動率データ」など
- activated, 有効化するかどうか。無効にするとこの関心事は無視され、無効にした後再び有効にできます。
プログラムを起動した後、pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/) を開いてください。
注意focus_pointの更新設定activatedの調整を含む後、**プログラムを再起動する必要があります。**
#### 5.1 サイトフォームを開く
#### 5.2 sitesフォームを開く
このフォームを使用して情報源を構成できます。注意:情報源は次のステップの focus_point フォームで選択する必要があります。
このフォームを使用して、カスタム情報源を指定できます。システムはバックグラウンドで定期的なタスクを開始し、ローカルで情報源のスキャン、解析、分析を実行します。
サイトフィールドの説明:
- url, 情報源のurl。情報源には具体的な記事ページを指定する必要はありません。記事リストページを指定してください。
- type, タイプ。web または rss。
sitesフィールド説明
- url, 情報源のurlで、情報源は具体的な記事ページを指定する必要はありません。記事リストページを指定するだけで十分です。
- per_hours, スキャン頻度で、単位は時間、整数型1~24の範囲、スキャン頻度は1日1回を超えないように、つまり24に設定することをお勧めします
- activated, 有効化するかどうか。無効にするとこの情報源は無視され、無効にした後再び有効にできます。
#### 5.2 フォーカスポイントフォームを開く
**sitesの設定調整は、プログラムを再起動する必要はありません。**
このフォームを使用してあなたのフォーカスポイントを指定できます。LLMはこれに基づいて情報を抽出、フィルタリング、分類します。
フィールドの説明:
- focuspoint, フォーカスポイントの説明(必須)。例:”新年セールの割引“、”入札通知“
- explanationフォーカスポイントの詳細な説明または具体的な約束。例”xxシリーズの製品”、”2025年1月1日以降に発行され、100万円以上の“
- activated, アクティブ化するかどうか。オフにすると、そのフォーカスポイントは無視されます。オフにした後、再度オンにできます。
- per_hour, クロール頻度。単位は時間で、整数型1~24の範囲。クロール頻度を1日1回を超えないように設定することをお勧めします。つまり、24に設定します。
- search_engine, クロール時に検索エンジンをオンにするかどうか
- sites対応する情報源を選択
**注意V0.3.8以降、設定の調整はプログラムを再起動する必要はありません。次回の実行時に自動的に適用されます。**
## 📚 あなた自身のプログラムでwiseflowがクロールしたデータをどのように使用するか
@ -257,6 +271,7 @@ PocketBaseは人気のある軽量データベースで、現在Go/Javascript/Py
- crawl4aiOpen-source LLM Friendly Web Crawler & Scraper https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase
- feedparser (Parse feeds in Python) https://github.com/kurtmckee/feedparser
また、[GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor)、[AutoCrawler](https://github.com/kingname/AutoCrawler)、[SeeAct](https://github.com/OSU-NLP-Group/SeeAct) からもインスピレーションを受けています。

View File

@ -1,8 +1,8 @@
# 수석 정보 책임자 (Wiseflow)
# AI 수석 정보 책임자 (Wiseflow)
**[English](README_EN.md) | [日本語](README_JP.md) | [简体中文](README.md)**
🚀 **수석 정보 책임자** (Wiseflow)는 대규모 언어 모델의 사고 및 분석 능력을 활용하여 다양한 정보원에서 특정 정보를 정확하게 추출할 수 있는 민첩한 정보 마이닝 도구입니다. 전체 과정에서 인간의 개입이 필요하지 않습니다.
🚀 **AI 수석 정보 책임자** (Wiseflow)는 대규모 언어 모델의 사고 및 분석 능력을 활용하여 다양한 정보원에서 특정 정보를 정확하게 추출할 수 있는 민첩한 정보 마이닝 도구입니다. 전체 과정에서 인간의 개입이 필요하지 않습니다.
**우리가 부족한 것은 정보가 아니라, 방대한 정보 속에서 노이즈를 필터링하여 가치 있는 정보를 드러내는 것입니다.**
@ -10,25 +10,30 @@
https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
## 🔥 V0.3.7이 출시되었습니다
## 🔥 V0.3.8 공식 출시
이번 업그레이드는 WeChat 공식 계정을 정보 소스로 추가할 수 있도록 wxbot 통합 솔루션을 제공합니다. 자세한 내용은 [weixin_mp/README.md](./weixin_mp/README.md)를 참조하세요.
- V0.3.8 버전에서는 RSS 및 검색 엔진에 대한 지원이 도입되어, 이제 wiseflow는 _웹사이트_, _rss_, _검색 엔진_, _WeChat 공식 계정_의 네 가지 정보원을 지원합니다!
또한 WeChat 공식 계정 게시물을 위한 전용 추출기를 제공하고, 사용자가 실제 요구 사항에 따라 커스터마이즈할 수 있도록 사용자 정의 추출기 인터페이스를 설계했습니다.
- 제품 전략이 관심 지점에 따라 정보원을 지정하는 방식으로 변경되어, 서로 다른 정보원에 대해 서로 다른 관심 지점을 지정할 수 있게 되었습니다. 동일한 모델에서 정보 추출의 정확성을 더욱 향상시킬 수 있음을 실험을 통해 확인했습니다.
이번 업그레이드는 정보 추출 기능을 더욱 강화했습니다. 페이지 내 링크 분석을 크게 최적화했을 뿐만 아니라, 7b, 14b와 같은 규모의 모델도 복잡한 관심 포인트(시간, 지표 제한 등을 포함하는 설명)를 기반으로 한 추출을 잘 수행할 수 있게 되었습니다.
- MacOS&Linux와 Windows 사용자 각각에게 단일 시작 스크립트를 제공하여, 시작 프로그램을 최적화하고 사용 편의성을 높였습니다.
또한 이번 업그레이드는 Crawl4ai 0.4.247 버전을 지원하고 많은 프로그램 개선을 했습니다. 자세한 내용은 [CHANGELOG.md](./CHANGELOG.md)를 참조하세요.
이번 업그레이드에 대한 자세한 내용은 [CHANGELOG.md](./CHANGELOG.md)를 참조하세요.
이 단계에서 다음 커뮤니티 기여자들의 PR에 감사드립니다:
**V0.3.8 버전의 검색 엔진은 Zhipu bigmodel 오픈 플랫폼에서 제공하는 서비스를 사용하며, .env 파일에 ZHIPU_API_KEY를 추가해야 합니다**
- @ourines가 install_pocketbase.sh 스크립트를 기여했습니다 (docker 실행 방식은 사용이 불편하다고 판단되어 임시로 제거되었습니다...)
- @ibaoger가 Windows용 pocketbase 설치 스크립트를 기여했습니다
- @tusik가 비동기 llm wrapper를 기여했습니다
**V0.3.8 버전에서는 pocketbase의 폼 구조가 조정되었으므로, 기존 사용자는 ./pocketbase migrate를 한 번 실행해야 합니다 (pb 폴더 내)**
**V0.3.7 버전은 사용 비용을 낮추기 위해 SECONDARY_MODEL을 다시 도입했습니다**
V0.3.8은 안정적인 버전입니다. 원래 계획된 V0.3.9는 커뮤니티의 피드백을 더 많이 수집하여 업그레이드 방향을 결정해야 하므로 출시까지 시간이 더 걸릴 것입니다.
### V0.3.7 테스트 보고서
다음 커뮤니티 멤버들이 V0.3.5~V0.3.8 버전에서 PR을 기여해 주셨습니다:
- @ourines는 install_pocketbase.sh 자동 설치 스크립트를 기여했습니다
- @ibaoger는 Windows용 pocketbase 자동 설치 스크립트를 기여했습니다
- @tusik는 비동기 llm wrapper를 기여하고 AsyncWebCrawler의 수명 주기 문제를 발견했습니다
- @c469591는 Windows 버전 시작 스크립트를 기여했습니다
### 🌟 테스트 보고서
최신 추출 전략에서, 7b와 같은 규모의 모델도 링크 분석 및 추출 작업을 잘 수행할 수 있다는 것을 발견했습니다. 테스트 결과는 [report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)를 참조하세요.
@ -38,14 +43,6 @@ https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b
현재 단계에서는 **테스트 결과 제출이 프로젝트 코드 제출과 동등하게 취급**되며, contributor로 인정받을 수 있고, 심지어 상업화 프로젝트에 초대될 수도 있습니다! 자세한 내용은 [test/README.md](./test/README.md)를 참조하세요.
🌟**V0.3.x 계획**
- ~~WeChat 공개 계정 wxbot 없이 구독 지원 (V0.3.7); Done ~~
- RSS 정보 소스 및 검색 엔진 지원 도입 (V0.3.8);
- 일부 사회적 플랫폼 지원 시도 (V0.3.9).
## ✋ wiseflow는 전통적인 크롤러 도구, AI 검색, 지식 베이스(RAG) 프로젝트와 어떻게 다를까요?
wiseflow는 2024년 6월 말 V0.3.0 버전 출시 이후 오픈소스 커뮤니티의 광범위한 관심을 받았으며, 심지어 많은 자체 미디어의 자발적인 보도까지 이끌어냈습니다. 이에 먼저 감사의 말씀을 전합니다!
@ -81,7 +78,7 @@ chmod +x install_pocketbase
./install_pocketbase
```
**windows users please execute [install_pocketbase.ps1](./install_pocketbase.ps1) script**
**윈도우 사용자는 [install_pocketbase.ps1](./install_pocketbase.ps1) 스크립트를 실행하세요**
Wiseflow 0.3.x는 데이터베이스로 pocketbase를 사용합니다. pocketbase 클라이언트를 수동으로 다운로드할 수도 있습니다(버전 0.23.4를 다운로드하여 [pb](./pb) 디렉토리에 배치하는 것을 잊지 마세요). 그리고 수퍼유저를 수동으로 생성할 수 있습니다(.env 파일에 저장하는 것을 잊지 마세요).
@ -105,12 +102,12 @@ Wiseflow는 LLM 네이티브 애플리케이션이므로 프로그램에 안정
Siliconflow는 대부분의 주류 오픈소스 모델에 대한 온라인 MaaS 서비스를 제공합니다. 축적된 추론 가속화 기술로 속도와 가격 모두에서 큰 장점이 있습니다. siliconflow의 서비스를 사용할 때 .env 구성은 다음을 참조할 수 있습니다:
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://api.siliconflow.cn/v1"
export PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
export SECONDARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
export VL_MODEL="OpenGVLab/InternVL2-26B"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://api.siliconflow.cn/v1"
PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
SECONDARY_MODEL="Qwen/Qwen2.5-14B-Instruct"
VL_MODEL="OpenGVLab/InternVL2-26B"
```
😄 원하신다면 제 [siliconflow 추천 링크](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92)를 사용하실 수 있습니다. 이를 통해 제가 더 많은 토큰 보상을 받을 수 있습니다 🌹
@ -121,12 +118,12 @@ export VL_MODEL="OpenGVLab/InternVL2-26B"
AiHubMix 모델을 사용할 때 .env 구성은 다음을 참조할 수 있습니다:
```bash
export LLM_API_KEY=Your_API_KEY
export LLM_API_BASE="https://aihubmix.com/v1" # refer https://doc.aihubmix.com/
export PRIMARY_MODEL="gpt-4o"
export SECONDARY_MODEL="gpt-4o-mini"
export VL_MODEL="gpt-4o"
```
LLM_API_KEY=Your_API_KEY
LLM_API_BASE="https://aihubmix.com/v1" # refer https://doc.aihubmix.com/
PRIMARY_MODEL="gpt-4o"
SECONDARY_MODEL="gpt-4o-mini"
VL_MODEL="gpt-4o"
```
😄 Welcome to register using the [AiHubMix referral link](https://aihubmix.com?aff=Gp54) 🌹
@ -135,22 +132,31 @@ export VL_MODEL="gpt-4o"
Xinference를 예로 들면, .env 구성은 다음을 참조할 수 있습니다:
```bash
```
# LLM_API_KEY='' no need for local service, please comment out or delete
export LLM_API_BASE='http://127.0.0.1:9997'
export PRIMARY_MODEL=launched_model_id
export VL_MODEL=launched_model_id
LLM_API_BASE='http://127.0.0.1:9997'
PRIMARY_MODEL=launched_model_id
VL_MODEL=launched_model_id
```
#### 3.2 Pocketbase Account and Password Configuration
```bash
export PB_API_AUTH="test@example.com|1234567890"
```
PB_API_AUTH="test@example.com|1234567890"
```
여기서 pocketbase 데이터베이스의 슈퍼유저 사용자 이름과 비밀번호를 설정합니다. |로 구분하는 것을 잊지 마세요 (install_pocketbase.sh 스크립트가 성공적으로 실행되었다면 이미 존재할 것입니다)
#### 3.3 기타 선택적 구성
#### 3.3 智谱bigmodel플랫폼 키 설정(검색 엔진 서비스에 사용)
```
ZHIPU_API_KEY=Your_API_KEY
```
(신청 주소https://bigmodel.cn/ 현재 무료로 제공 중입니다)
#### 3.4 기타 선택적 구성
다음은 모두 선택적 구성입니다:
- #VERBOSE="true"
@ -169,8 +175,6 @@ export PB_API_AUTH="test@example.com|1234567890"
llm 동시 요청 수를 제어하는 데 사용됩니다. 설정하지 않으면 기본값은 1입니다(활성화하기 전에 llm 제공자가 설정된 동시성을 지원하는지 확인하세요. 로컬 대규모 모델은 하드웨어 기반에 자신이 있지 않는 한 신중하게 사용하세요)
@tusik이 기여한 비동기 llm wrapper에 감사드립니다
### 4. 프로그램 실행
✋ V0.3.5 버전의 아키텍처와 종속성은 이전 버전과 크게 다릅니다. 코드를 다시 가져오고, pb_data를 삭제(또는 재구축)하도록 하세요
@ -188,39 +192,51 @@ conda activate wiseflow
cd wiseflow
cd core
pip install -r requirements.txt
chmod +x run.sh
./run_task.sh # if you just want to scan sites one-time (no loop), use ./run.sh
```
🌟 이 스크립트는 pocketbase가 이미 실행 중인지 자동으로 확인합니다. 실행 중이 아닌 경우 자동으로 시작됩니다. 단, ctrl+c 또는 ctrl+z로 프로세스를 종료할 때 터미널을 닫을 때까지 pocketbase 프로세스는 종료되지 않는다는 점에 유의하세요.
이후 MacOS&Linux 사용자는 실행합니다
run_task.sh는 주기적으로 크롤링-추출 작업을 실행합니다(시작 시 즉시 실행되고 그 후 매시간마다 실행됨). 한 번만 실행하면 되는 경우 run.sh 스크립트를 사용할 수 있습니다.
```bash
chmod +x run.sh
./run.sh
```
### 5. **관심사 및 정기 스캔 정보 소스 추가**
Windows 사용자는 실행합니다
프로그램을 시작한 후, pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)를 여세요.
```bash
python windows_run.py
```
#### 5.1 focus_point 폼 열기
- 이 스크립트는 pocketbase가 이미 실행 중인지 자동으로 판단하고, 실행 중이 아니면 자동으로 시작합니다. 그러나 주의해주세요, ctrl+c 또는 ctrl+z로 프로세스를 종료하더라도, pocketbase 프로세스는 종료되지 않습니다. 터미널을 닫을 때까지입니다.
이 폼을 통해 귀하의 관심사를 지정할 수 있으며, LLM은 이를 기반으로 정보를 추출, 필터링 및 분류합니다.
- run.sh는 먼저 활성화된 (activated가 true로 설정된) 모든 정보원에 대해 한 번의 크롤링 작업을 실행한 후, 설정된 빈도로 시간 단위로 주기적으로 실행합니다.
필드 설명:
- focuspoint, 관심사 설명 (필수), 예: "상하이 초등학교 졸업 정보", "암호화폐 가격"
- explanation, 관심사의 상세 설명 또는 구체적인 약속, 예: "상하이 공식 발표 중학교 입학 정보만 포함", "BTC, ETH의 현재 가격, 등락률 데이터" 등
- activated, 활성화 여부. 비활성화되면 해당 관심사는 무시되며, 비활성화 후 다시 활성화할 수 있습니다.
### 5. 관심사 및 정보원 추가
주의: focus_point 업데이트 설정 (activated 조정 포함) 후, **프로그램을 다시 시작해야 적용됩니다.**
프로그램을 시작한 후, pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)를 엽니다.
#### 5.2 sites 폼 열기
#### 5.1 sites 폼 열기
이 폼을 통해 사용자 정의 정보 소스를 지정할 수 있으며, 시스템은 백그라운드 정기 작업을 시작하여 로컬에서 정보 소스를 스캔, 구문 분석 및 분석합니다.
이 폼을 통해 정보원을 구성할 수 있습니다. 注意:정보원은 다음 단계의 focus_point 폼에서 선택해야 합니다.
sites 필드 설명:
- url, 정보 소스의 URL, 정보 소스는 구체적인 기사 페이지를 제공할 필요가 없으며, 기사 목록 페이지만 제공하면 됩니다.
- per_hours, 스캔 빈도, 단위는 시간, 정수 형식 (1~24 범위, 스캔 빈도를 하루에 한 번 이상으로 설정하지 않는 것을 권장합니다. 즉, 24로 설정).
- activated, 활성화 여부. 비활성화되면 해당 정보 소스는 무시되며, 비활성화 후 다시 활성화할 수 있습니다.
sites 필드 설명:
- url, 정보원의 url, 정보원은 특정 기사 페이지를 지정할 필요가 없습니다. 기사 목록 페이지를 지정하면 됩니다.
- type, 유형, web 또는 rss입니다.
#### 5.2 focus_point 폼 열기
이 폼을 통해 당신의 관심사를 지정할 수 있습니다. LLM은 이를 기반으로 정보를 추출, 필터링, 분류합니다.
필드 설명:
- focuspoint, 관심사 설명(필수),예를 들어 "새해 할인" 또는 "입찰 공고"
- explanation, 관심사의 상세 설명 또는 구체적 약정, 예를 들어 "어떤 브랜드" 또는 "2025년 1월 1일 이후에 발행된 날짜, 100만원 이상의 금액" 등
- activated, 활성화 여부. 활성화하지 않으면 해당 관심사는 무시됩니다. 활성화하지 않으면 나중에 다시 활성화할 수 있습니다.
- per_hour, 크롤링 빈도, 시간 단위, 정수 형식1~24 범위, 우리는 하루에 한 번씩 스캔하는 것을 추천합니다, 즉 24로 설정합니다)
- search_engine, 각 크롤링 시 검색 엔진을 활성화할지 여부
- sites, 해당 정보원을 선택
**참고V0.3.8 버전 이후, 설정의 변경은 프로그램을 재시작하지 않아도 다음 실행 시 자동으로 적용됩니다.**
**sites 설정 조정은 프로그램을 다시 시작할 필요가 없습니다.**
## 📚 wiseflow가 크롤링한 데이터를 귀하의 프로그램에서 사용하는 방법
@ -254,6 +270,7 @@ PocketBase는 인기 있는 경량 데이터베이스로, 현재 Go/Javascript/P
- crawl4aiOpen-source LLM Friendly Web Crawler & Scraper https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase 클라이언트 SDK for python) https://github.com/vaphes/pocketbase
- feedparser (Parse feeds in Python) https://github.com/kurtmckee/feedparser
또한 [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor), [AutoCrawler](https://github.com/kingname/AutoCrawler), [SeeAct](https://github.com/OSU-NLP-Group/SeeAct) 에서 영감을 받았습니다.

View File

@ -162,8 +162,8 @@ async def pre_process(raw_markdown: str, base_url: str, used_img: list[str],
ratio = total_links / section_remain_len if section_remain_len != 0 else 1
if ratio > 0.05:
if test_mode:
print('this is a navigation section, will be removed')
print(ratio)
print('\033[31mthis is a navigation section, will be removed\033[0m')
print(ratio, '\n')
print(section_remain)
print('-' * 50)
sections = sections[1:]
@ -172,7 +172,7 @@ async def pre_process(raw_markdown: str, base_url: str, used_img: list[str],
section_remain_len = len(section_remain)
if section_remain_len < 198:
if test_mode:
print('this is a footer section, will be removed')
print('\033[31mthis is a footer section, will be removed\n\033[0m')
print(section_remain_len)
print(section_remain)
print('-' * 50)
@ -184,15 +184,15 @@ async def pre_process(raw_markdown: str, base_url: str, used_img: list[str],
ratio, text = await check_url_text(section)
if ratio < 70:
if test_mode:
print('this is a links part')
print(ratio)
print('\033[32mthis is a links part\033[0m')
print(ratio, '\n')
print(text)
print('-' * 50)
links_parts.append(text)
else:
if test_mode:
print('this is a content part')
print(ratio)
print('\033[34mthis is a content part\033[0m')
print(ratio, '\n')
print(text)
print('-' * 50)
contents.append(text)
@ -257,9 +257,10 @@ async def get_more_related_urls(texts: list[str], link_dict: dict, prompts: list
[{'role': 'system', 'content': sys_prompt}, {'role': 'user', 'content': content}],
model=model, temperature=0.1)
result = re.findall(r'\"\"\"(.*?)\"\"\"', result, re.DOTALL)
if test_mode:
print(f"llm output:\n {result}")
result = re.findall(r'\"\"\"(.*?)\"\"\"', result, re.DOTALL)
if result:
links = re.findall(r'\[\d+\]', result[-1])
for link in links:
@ -284,7 +285,7 @@ async def get_more_related_urls(texts: list[str], link_dict: dict, prompts: list
return more_urls
async def get_info(texts: list[str], link_dict: dict, prompts: list[str], focus_dict: dict, author: str, publish_date: str,
async def get_info(texts: list[str], link_dict: dict, prompts: list[str], author: str, publish_date: str,
test_mode: bool = False, _logger: logger = None) -> list[dict]:
sys_prompt, suffix, model = prompts
@ -294,7 +295,6 @@ async def get_info(texts: list[str], link_dict: dict, prompts: list[str], focus_
else:
info_pre_fix = f"//{author} {publish_date}//"
cache = set()
batches = []
text_batch = ''
while texts:
@ -310,38 +310,27 @@ async def get_info(texts: list[str], link_dict: dict, prompts: list[str], focus_
for content in batches]
results = await asyncio.gather(*tasks)
final = []
for res in results:
if test_mode:
print(f"llm output:\n {res}")
extracted_result = re.findall(r'\"\"\"(.*?)\"\"\"', res, re.DOTALL)
if extracted_result:
cache.add(extracted_result[-1])
final = []
for item in cache:
segs = item.split('//')
i = 0
while i < len(segs) - 1:
focus = segs[i].strip()
if not focus:
i += 1
continue
if focus not in focus_dict:
if _logger:
_logger.info(f"llm hallucination: {item}")
if test_mode:
print(f"llm hallucination: {item}")
i += 1
continue
content = segs[i+1].strip().strip('摘要').strip(':').strip('')
i += 2
if not content or content == 'NA':
res = res.strip().lstrip('摘要').lstrip(':').lstrip('')
if not res or res == 'NA':
continue
"""
maybe can use embedding retrieval to judge
"""
url_tags = re.findall(r'\[\d+\]', content)
refences = {url_tag: link_dict[url_tag] for url_tag in url_tags if url_tag in link_dict}
final.append({'tag': focus_dict[focus], 'content': f"{info_pre_fix}{content}", 'references': refences})
url_tags = re.findall(r'\[\d+]', res)
refences = {}
for _tag in url_tags:
if _tag in link_dict:
refences[_tag] = link_dict[_tag]
else:
if _logger:
_logger.warning(f"model hallucination: {res} \ncontains {_tag} which is not in link_dict")
if test_mode:
print(f"model hallucination: {res} \ncontains {_tag} which is not in link_dict")
res = res.replace(_tag, '')
final.append({'content': f"{info_pre_fix}{res}", 'references': refences})
return final

View File

@ -1,5 +1,5 @@
get_link_system = '''你将被给到一段使用<text></text>标签包裹的网页文本,你的任务是从前到后仔细阅读文本,提取出与如下任一关注点相关的原文片段。关注点及其解释如下:
get_link_system = '''你将被给到一段使用<text></text>标签包裹的网页文本,你的任务是从前到后仔细阅读文本,提取出与如下关注点相关的原文片段。关注点及其解释如下:
{focus_statement}\n
在进行提取时请遵循以下原则
@ -15,15 +15,15 @@ get_link_suffix = '''请逐条输出提取的原文片段,并整体用三引
...
"""'''
get_link_system_en = '''You will be given a webpage text wrapped in <text></text> tags. Your task is to carefully read the text from beginning to end, extracting fragments related to any of the following focus points. The focus points and their explanations are as follows:
get_link_system_en = '''You will be given a webpage text wrapped in <text></text> tags. Your task is to carefully read the text from beginning to end, extracting fragments related to the following focus point. The focus point and its explanation are as follows:
{focus_statement}\n
When extracting fragments, please follow these principles:
- Understand the meaning of each focus point and its explanation (if any), ensure the extracted content strongly relates to the focus point and aligns with the explanation (if any)
- Understand the meaning of the focus point and its explanation (if any), ensure the extracted content strongly relates to the focus point and aligns with the explanation (if any)
- Extract all possible related fragments
- Ensure the extracted fragments retain the reference markers like "[3]", as these will be used in subsequent processing'''
get_link_suffix_en = '''Please output each extracted fragment one by one, and wrap the entire output in triple quotes. The triple quotes should contain only the extracted fragments, with no other content. If the text does not contain any content related to the focus points, keep the triple quotes empty.
get_link_suffix_en = '''Please output each extracted fragment one by one, and wrap the entire output in triple quotes. The triple quotes should contain only the extracted fragments, with no other content. If the text does not contain any content related to the focus point, keep the triple quotes empty.
Here is an example of the output format:
"""
Fragment 1
@ -31,43 +31,24 @@ Fragment 2
...
"""'''
get_info_system = '''你将被给到一段使用<text></text>标签包裹的网页文本,请分别按如下关注点对网页文本提炼摘要。关注点列表及其解释如下:
get_info_system = '''你将被给到一段使用<text></text>标签包裹的网页文本,请按如下关注点对网页文本提炼摘要。关注点及其解释如下
{focus_statement}\n
在提炼摘要时请遵循以下原则
- 理解每个关注点的含义以及进一步的解释如有确保摘要与关注点强相关并符合解释如有的范围
- 摘要应当详实充分且绝对忠于原文
- 理解关注点的含义以及进一步的解释如有确保摘要与关注点强相关并符合解释如有的范围
- 摘要中应该包括与关注点最相关的那些原文片段
- 如果摘要涉及的原文片段中包含类似"[3]"这样的引用标记务必在摘要中保留相关标记'''
get_info_suffix = '''请对关注点逐一生成摘要,不要遗漏任何关注点,如果网页文本与关注点无关,可以对应输出"NA"。输出结果整体用三引号包裹,三引号内不要有其他内容。如下是输出格式示例:
"""
//关注点1//
摘要1
//关注点2//
摘要2
//关注点3//
NA
...
"""'''
get_info_suffix = '''请直接输出摘要不要输出任何其他内容如果网页文本与关注点无关则输出NA。'''
get_info_system_en = '''You will be given a webpage text wrapped in <text></text> tags. Please extract summaries from the text according to the following focus points. The list of focus points and their explanations are as follows:
get_info_system_en = '''You will be given a webpage text wrapped in <text></text> tags. Please extract summaries from the text according to the following focus point. The focus point and its explanation are as follows:
{focus_statement}\n
When extracting summaries, please follow these principles:
- Understand the meaning of each focus point and its explanation (if any), ensure the summary strongly relates to the focus point and aligns with the explanation (if any)
- The summary should be detailed and comprehensive and absolutely faithful to the original text
- Understand the meaning of the focus point and its explanation (if any), ensure the summary strongly relates to the focus point and aligns with the explanation (if any)
- The summary should include the most relevant text fragments related to the focus point
- If the summary involves a reference marker like "[3]", it must be retained in the summary'''
get_info_suffix_en = '''Please generate summaries for each focus point, don't miss any focus points. If the webpage text is not related to a focus point, output "NA" for that point. The entire output should be wrapped in triple quotes with no other content inside. Here is an example of the output format:
"""
//Focus Point 1//
Summary 1
//Focus Point 2//
Summary 2
//Focus Point 3//
NA
...
"""'''
get_info_suffix_en = '''Please output the summary directly, without any other content. If the webpage text is not related to the focus point, output "NA".'''
get_ap_system = "As an information extraction assistant, your task is to accurately extract the source (or author) and publication date from the given webpage text. It is important to adhere to extracting the information directly from the original text. If the original text does not contain a particular piece of information, please replace it with NA"
get_ap_suffix = '''Please output the extracted information in the following format(output only the result, no other content):

View File

@ -9,7 +9,7 @@ set +o allexport
pocketbase_pid=$!
# 启动 Python 任务
python tasks.py &
python run_task.py &
python_pid=$!
# 启动 Uvicorn

View File

@ -1,38 +1,47 @@
# -*- coding: utf-8 -*-
import os
from utils.pb_api import PbTalker
from utils.general_utils import get_logger, extract_and_convert_dates, is_chinese
from agents.get_info import *
import json
import asyncio
from scrapers import *
from utils.zhipu_search import run_v4_async
from urllib.parse import urlparse
from crawl4ai import AsyncWebCrawler, CacheMode
from datetime import datetime, timedelta
import logging
from datetime import datetime
import feedparser
logging.getLogger("httpx").setLevel(logging.WARNING)
project_dir = os.environ.get("PROJECT_DIR", "")
if project_dir:
os.makedirs(project_dir, exist_ok=True)
wiseflow_logger = get_logger('general_process', project_dir)
wiseflow_logger = get_logger('wiseflow', project_dir)
pb = PbTalker(wiseflow_logger)
one_month_ago = (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d')
existing_urls = {url['url'] for url in pb.read(collection_name='infos', fields=['url'], filter=f"created>='{one_month_ago}'")}
crawler = AsyncWebCrawler(verbose=False)
model = os.environ.get("PRIMARY_MODEL", "")
if not model:
raise ValueError("PRIMARY_MODEL not set, please set it in environment variables or edit core/.env")
secondary_model = os.environ.get("SECONDARY_MODEL", model)
async def save_to_pb(url: str, url_title: str, infos: list):
# saving to pb process
async def info_process(url: str,
url_title: str,
author: str,
publish_date: str,
contents: list[str],
link_dict: dict,
focus_id: str,
get_info_prompts: list[str]):
infos = await get_info(contents, link_dict, get_info_prompts, author, publish_date, _logger=wiseflow_logger)
if infos:
wiseflow_logger.debug(f'get {len(infos)} infos, will save to pb')
for info in infos:
info['url'] = url
info['url_title'] = url_title
info['tag'] = focus_id
_ = pb.add(collection_name='infos', body=info)
if not _:
wiseflow_logger.error('add info failed, writing to cache_file')
@ -41,29 +50,19 @@ async def save_to_pb(url: str, url_title: str, infos: list):
json.dump(info, f, ensure_ascii=False, indent=4)
async def main_process(_sites: set | list):
# collect tags user set in pb database and determin the system prompt language based on tags
focus_data = pb.read(collection_name='focus_points', filter=f'activated=True')
if not focus_data:
wiseflow_logger.info('no activated tag found, will ask user to create one')
focus = input('It seems you have not set any focus point, WiseFlow need the specific focus point to guide the following info extract job.'
'so please input one now. describe what info you care about shortly: ')
explanation = input('Please provide more explanation for the focus point (if not necessary, pls just press enter: ')
focus_data.append({"focuspoint": focus, "explanation": explanation,
"id": pb.add('focus_points', {"focuspoint": focus, "explanation": explanation})})
focus_dict = {item["focuspoint"]: item["id"] for item in focus_data}
focus_statement = ''
for item in focus_data:
tag = item["focuspoint"]
expl = item["explanation"]
focus_statement = f"{focus_statement}//{tag}//\n"
if expl:
if is_chinese(expl):
focus_statement = f"{focus_statement}解释:{expl}\n"
async def main_process(focus: dict, sites: list):
wiseflow_logger.debug('new task initializing...')
focus_id = focus["id"]
focus_point = focus["focuspoint"].strip()
explanation = focus["explanation"].strip()
wiseflow_logger.debug(f'focus_id: {focus_id}, focus_point: {focus_point}, explanation: {explanation}, search_engine: {focus["search_engine"]}')
existing_urls = {url['url'] for url in pb.read(collection_name='infos', fields=['url'], filter=f"tag='{focus_id}'")}
focus_statement = f"//{focus_point}//"
if explanation:
if is_chinese(explanation):
focus_statement = f"{focus_statement}\n解释:{explanation}"
else:
focus_statement = f"{focus_statement}Explanation: {expl}\n"
focus_statement = f"{focus_statement}\nExplanation: {explanation}"
date_stamp = datetime.now().strftime('%Y-%m-%d')
if is_chinese(focus_statement):
@ -81,9 +80,55 @@ async def main_process(_sites: set | list):
get_info_sys_prompt = f"today is {date_stamp}, {get_info_sys_prompt}"
get_info_suffix_prompt = get_info_suffix_en
recognized_img_cache = {}
get_link_prompts = [get_link_sys_prompt, get_link_suffix_prompt, secondary_model]
get_info_prompts = [get_info_sys_prompt, get_info_suffix_prompt, model]
working_list = set()
working_list.update(_sites)
if focus.get('search_engine', False):
query = focus_point if not explanation else f"{focus_point}({explanation})"
search_intent, search_content = await run_v4_async(query, _logger=wiseflow_logger)
_intent = search_intent['search_intent'][0]['intent']
_keywords = search_intent['search_intent'][0]['keywords']
wiseflow_logger.info(f'query: {query}\nsearch intent: {_intent}\nkeywords: {_keywords}')
search_results = search_content['search_result']
for result in search_results:
url = result['link']
if url in existing_urls:
continue
if '(发布时间' not in result['title']:
wiseflow_logger.debug(f'can not find publish time in the search result {url}, adding to working list')
working_list.add(url)
continue
title, publish_date = result['title'].split('(发布时间')
title = title.strip() + '(from search engine)'
publish_date = publish_date.strip('')
# 严格匹配YYYY-MM-DD格式
date_match = re.search(r'\d{4}-\d{2}-\d{2}', publish_date)
if date_match:
publish_date = date_match.group()
publish_date = extract_and_convert_dates(publish_date)
else:
wiseflow_logger.warning(f'can not find publish time in the search result {url}, adding to working list')
working_list.add(url)
continue
author = result['media']
texts = [result['content']]
await info_process(url, title, author, publish_date, texts, {}, focus_id, get_info_prompts)
recognized_img_cache = {}
for site in sites:
if site.get('type', 'web') == 'rss':
try:
feed = feedparser.parse(site['url'])
except Exception as e:
wiseflow_logger.warning(f"{site['url']} RSS feed is not valid: {e}")
continue
rss_urls = {entry.link for entry in feed.entries if entry.link}
wiseflow_logger.debug(f'get {len(rss_urls)} urls from rss source {site["url"]}')
working_list.update(rss_urls - existing_urls)
else:
working_list.add(site['url'])
await crawler.start()
while working_list:
url = working_list.pop()
@ -104,7 +149,7 @@ async def main_process(_sites: set | list):
else:
run_config = crawler_config
run_config.cache_mode = CacheMode.WRITE_ONLY if url in _sites else CacheMode.ENABLED
run_config.cache_mode = CacheMode.WRITE_ONLY if url in sites else CacheMode.ENABLED
result = await crawler.arun(url=url, config=run_config)
if not result.success:
wiseflow_logger.warning(f'{url} failed to crawl, destination web cannot reach, skip')
@ -116,6 +161,8 @@ async def main_process(_sites: set | list):
raw_markdown = result.content
used_img = result.images
title = result.title
if title == 'maybe a new_type_article':
wiseflow_logger.warning(f'we found a new type here,{url}')
base_url = result.base
author = result.author
publish_date = result.publish_date
@ -146,11 +193,10 @@ async def main_process(_sites: set | list):
link_dict, links_parts, contents, recognized_img_cache = await pre_process(raw_markdown, base_url, used_img, recognized_img_cache, existing_urls)
if link_dict and links_parts:
prompts = [get_link_sys_prompt, get_link_suffix_prompt, secondary_model]
links_texts = []
for _parts in links_parts:
links_texts.extend(_parts.split('\n\n'))
more_url = await get_more_related_urls(links_texts, link_dict, prompts, _logger=wiseflow_logger)
more_url = await get_more_related_urls(links_texts, link_dict, get_link_prompts, _logger=wiseflow_logger)
if more_url:
wiseflow_logger.debug(f'get {len(more_url)} more related urls, will add to working list')
working_list.update(more_url - existing_urls)
@ -169,15 +215,6 @@ async def main_process(_sites: set | list):
else:
publish_date = date_stamp
prompts = [get_info_sys_prompt, get_info_suffix_prompt, model]
infos = await get_info(contents, link_dict, prompts, focus_dict, author, publish_date, _logger=wiseflow_logger)
if infos:
wiseflow_logger.debug(f'get {len(infos)} infos, will save to pb')
await save_to_pb(url, title, infos)
await info_process(url, title, author, publish_date, contents, link_dict, focus_id, get_info_prompts)
await crawler.close()
if __name__ == '__main__':
sites = pb.read('sites', filter='activated=True')
wiseflow_logger.info('execute all sites one time')
asyncio.run(main_process([site['url'] for site in sites]))

View File

@ -6,3 +6,4 @@ pydantic
beautifulsoup4
requests
crawl4ai==0.4.247
feedparser==6.0.11

View File

@ -1,9 +1,4 @@
#!/bin/bash
set -o allexport
source .env
set +o allexport
if ! pgrep -x "pocketbase" > /dev/null; then
if ! netstat -tuln | grep ":8090" > /dev/null && ! lsof -i :8090 > /dev/null; then
echo "Starting PocketBase..."
@ -15,4 +10,4 @@ else
echo "PocketBase is already running."
fi
python general_process.py
python run_task.py

36
core/run_task.py Normal file
View File

@ -0,0 +1,36 @@
# -*- coding: utf-8 -*-
from pathlib import Path
from dotenv import load_dotenv
env_path = Path(__file__).parent / '.env'
if env_path.exists():
load_dotenv(env_path)
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
import asyncio
from general_process import main_process, wiseflow_logger, pb
counter = 0
async def schedule_task():
global counter
while True:
wiseflow_logger.info(f'task execute loop {counter + 1}')
tasks = pb.read('focus_points', filter='activated=True')
sites_record = pb.read('sites')
jobs = []
for task in tasks:
if not task['per_hour'] or not task['focuspoint']:
continue
if counter % task['per_hour'] != 0:
continue
sites = [_record for _record in sites_record if _record['id'] in task['sites']]
jobs.append(main_process(task, sites))
counter += 1
await asyncio.gather(*jobs)
wiseflow_logger.info(f'task execute loop finished, work after 3600 seconds')
await asyncio.sleep(3600)
asyncio.run(schedule_task())

View File

@ -1,18 +0,0 @@
#!/bin/bash
set -o allexport
source .env
set +o allexport
if ! pgrep -x "pocketbase" > /dev/null; then
if ! netstat -tuln | grep ":8090" > /dev/null && ! lsof -i :8090 > /dev/null; then
echo "Starting PocketBase..."
../pb/pocketbase serve --http=127.0.0.1:8090 &
else
echo "Port 8090 is already in use."
fi
else
echo "PocketBase is already running."
fi
python tasks.py

View File

@ -243,6 +243,10 @@ def mp_scraper(fetch_result: CrawlResult | dict) -> ScraperResultData:
title = 'maybe a new_type_article'
# 提取与包含 <h1> 元素的 div 块平级的紧挨着的下一个 div 块作为 content
content_div = h1_div.find_next_sibling('div')
if not content_div:
title = 'maybe a new_type_article'
content = raw_markdown
else:
content = title + '\n\n' + process_content(content_div)
else:
author = None

View File

@ -1,31 +0,0 @@
import asyncio
from general_process import main_process, pb, wiseflow_logger
counter = 1
async def schedule_pipeline(interval):
global counter
while True:
wiseflow_logger.info(f'task execute loop {counter}')
sites = pb.read('sites', filter='activated=True')
todo_urls = set()
for site in sites:
if not site['per_hours'] or not site['url']:
continue
if counter % site['per_hours'] == 0:
wiseflow_logger.info(f"applying {site['url']}")
todo_urls.add(site['url'])
counter += 1
await main_process(todo_urls)
wiseflow_logger.info(f'task execute loop finished, work after {interval} seconds')
await asyncio.sleep(interval)
async def main():
interval_hours = 1
interval_seconds = interval_hours * 60 * 60
await schedule_pipeline(interval_seconds)
asyncio.run(main())

31
core/utils/exa_search.py Normal file
View File

@ -0,0 +1,31 @@
import httpx
headers = {
"x-api-key": "",
"Content-Type": "application/json"
}
def search_with_exa(query: str) -> str:
url = "https://api.exa.ai/search"
payload = {
"query": query,
"useAutoprompt": True,
"type": "auto",
"category": "news",
"numResults": 5,
"startCrawlDate": "2024-12-01T00:00:00.000Z",
"endCrawlDate": "2025-01-21T00:00:00.000Z",
"startPublishedDate": "2024-12-01T00:00:00.000Z",
"endPublishedDate": "2025-01-21T00:00:00.000Z",
"contents": {
"text": {
"maxCharacters": 1000,
"includeHtmlTags": False
},
"livecrawl": "always",
}
}
response = httpx.post(url, json=payload, headers=headers, timeout=30)
return response.text

View File

@ -27,7 +27,8 @@ class PbTalker:
else:
raise Exception("pocketbase auth failed")
def read(self, collection_name: str, fields: Optional[List[str]] = None, filter: str = '', skiptotal: bool = True) -> list:
def read(self, collection_name: str, fields: Optional[List[str]] = None,
expand: Optional[List[str]] = None, filter: str = '', skiptotal: bool = True) -> list:
results = []
i = 1
while True:
@ -35,6 +36,7 @@ class PbTalker:
res = self.client.collection(collection_name).get_list(i, 500,
{"filter": filter,
"fields": ','.join(fields) if fields else '',
"expand": ','.join(expand) if expand else '',
"skiptotal": skiptotal})
except Exception as e:

View File

@ -0,0 +1,67 @@
import httpx
import uuid
import os
zhipu_api_key = os.getenv('ZHIPU_API_KEY', '')
async def run_v4_async(query: str, _logger=None):
if not zhipu_api_key:
if _logger:
_logger.warning("ZHIPU_API_KEY is not set")
else:
print("ZHIPU_API_KEY is not set")
return None
msg = [
{
"role": "user",
"content": query
}
]
tool = "web-search-pro"
url = "https://open.bigmodel.cn/api/paas/v4/tools"
request_id = str(uuid.uuid4())
data = {
"request_id": request_id,
"tool": tool,
"stream": False,
"messages": msg
}
async with httpx.AsyncClient() as client:
resp = await client.post(
url,
json=data,
headers={'Authorization': zhipu_api_key},
timeout=300
)
result = resp.json()
result = result['choices'][0]['message']['tool_calls']
return result[0], result[1]
if __name__ == '__main__':
test_list = [#'广东全省的台风预警——仅限2024年的信息',
#'大模型技术突破与创新——包括新算法与模型,新的研究成果',
#'事件图谱方面的知识',
#'人工智能领军人物介绍',
#'社区治理',
#'新获批的氢能项目——60万吨级别以上',
'氢能项目招标信息——2024年12月以后',
#'各地住宅网签最新数据——2025年1月6日以后'
]
async def main():
from pprint import pprint
tasks = [run_v4_async(query) for query in test_list]
results = await asyncio.gather(*tasks)
for query, (intent, content) in zip(test_list, results):
print(query)
print('\n')
print('test bigmodel...')
pprint(intent)
print('\n')
pprint(content)
print('\n')
import asyncio
asyncio.run(main())

View File

@ -4,7 +4,7 @@ import subprocess
import socket
import psutil
from pathlib import Path
from dotenv import load_dotenv
#檢查指定端口是否被使用
def is_port_in_use(port):
@ -61,24 +61,18 @@ def start_pocketbase():
return False
def main():
# 載入環境變數
env_path = Path(__file__).parent / 'windows.env'
if env_path.exists():
load_dotenv(env_path)
else:
print("Warning: .env file not found")
# 啟動 PocketBase
if start_pocketbase():
# 運行 Python 處理腳本
try:
process_script = Path(__file__).parent / 'windows_general_process.py'
process_script = Path(__file__).parent / 'run_task.py'
if process_script.exists():
subprocess.run([sys.executable, str(process_script)], check=True)
else:
print(f"Error: general_process.py not found at: {process_script}")
print(f"Error: run_task.py not found at: {process_script}")
except subprocess.CalledProcessError as e:
print(f"Error running general_process.py: {e}")
print(f"Error running run_task.py: {e}")
else:
print("Failed to start services")

View File

@ -1,14 +1,15 @@
export LLM_API_KEY=""
export LLM_API_BASE="https://api.siliconflow.cn/v1"
export PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
export SECONDARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
LLM_API_KEY=""
LLM_API_BASE="https://api.siliconflow.cn/v1"
ZHIPU_API_KEY="" #for the search tool
PRIMARY_MODEL="Qwen/Qwen2.5-32B-Instruct"
SECONDARY_MODEL="Qwen/Qwen2.5-14B-Instruct"
#use a secondary model to excute the filtering task for the cost saving
#if not set, will use the primary model to excute the filtering task
export VL_MODEL="OpenGVLab/InternVL2-26B"
export PB_API_AUTH="test@example.com|1234567890" ##your pb superuser account and password
VL_MODEL="OpenGVLab/InternVL2-26B"
PB_API_AUTH="test@example.com|1234567890" ##your pb superuser account and password
##belowing is optional, go as you need
#export VERBOSE="true" ##for detail log info. If not need, remove this item.
export PROJECT_DIR="work_dir"
#export PB_API_BASE="" ##only use if your pb not run on 127.0.0.1:8090
#export LLM_CONCURRENT_NUMBER=8 ##for concurrent llm requests, make sure your llm provider supports it(leave default is 1)
#VERBOSE="true" ##for detail log info. If not need, remove this item.
PROJECT_DIR="work_dir"
#PB_API_BASE="" ##only use if your pb not run on 127.0.0.1:8090
#LLM_CONCURRENT_NUMBER=8 ##for concurrent llm requests, make sure your llm provider supports it(leave default is 1)

View File

@ -0,0 +1,41 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// remove field
collection.fields.removeById("number1152796692")
// remove field
collection.fields.removeById("bool806155165")
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// add field
collection.fields.addAt(2, new Field({
"hidden": false,
"id": "number1152796692",
"max": null,
"min": null,
"name": "per_hours",
"onlyInt": false,
"presentable": false,
"required": false,
"system": false,
"type": "number"
}))
// add field
collection.fields.addAt(3, new Field({
"hidden": false,
"id": "bool806155165",
"name": "activated",
"presentable": false,
"required": false,
"system": false,
"type": "bool"
}))
return app.save(collection)
})

View File

@ -0,0 +1,29 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// add field
collection.fields.addAt(2, new Field({
"hidden": false,
"id": "select2363381545",
"maxSelect": 1,
"name": "type",
"presentable": false,
"required": false,
"system": false,
"type": "select",
"values": [
"web",
"rss"
]
}))
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// remove field
collection.fields.removeById("select2363381545")
return app.save(collection)
})

View File

@ -0,0 +1,24 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// remove field
collection.fields.removeById("bool806155165")
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// add field
collection.fields.addAt(3, new Field({
"hidden": false,
"id": "bool806155165",
"name": "activated",
"presentable": false,
"required": false,
"system": false,
"type": "bool"
}))
return app.save(collection)
})

View File

@ -0,0 +1,36 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// update field
collection.fields.addAt(1, new Field({
"exceptDomains": [],
"hidden": false,
"id": "url4101391790",
"name": "url",
"onlyDomains": [],
"presentable": true,
"required": true,
"system": false,
"type": "url"
}))
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_2001081480")
// update field
collection.fields.addAt(1, new Field({
"exceptDomains": [],
"hidden": false,
"id": "url4101391790",
"name": "url",
"onlyDomains": [],
"presentable": false,
"required": true,
"system": false,
"type": "url"
}))
return app.save(collection)
})

View File

@ -0,0 +1,59 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// add field
collection.fields.addAt(3, new Field({
"hidden": false,
"id": "bool806155165",
"name": "activated",
"presentable": false,
"required": false,
"system": false,
"type": "bool"
}))
// add field
collection.fields.addAt(4, new Field({
"hidden": false,
"id": "number3171882809",
"max": null,
"min": null,
"name": "per_hour",
"onlyInt": false,
"presentable": false,
"required": true,
"system": false,
"type": "number"
}))
// add field
collection.fields.addAt(5, new Field({
"cascadeDelete": false,
"collectionId": "pbc_2001081480",
"hidden": false,
"id": "relation3154160227",
"maxSelect": 999,
"minSelect": 0,
"name": "sites",
"presentable": false,
"required": false,
"system": false,
"type": "relation"
}))
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// remove field
collection.fields.removeById("bool806155165")
// remove field
collection.fields.removeById("number3171882809")
// remove field
collection.fields.removeById("relation3154160227")
return app.save(collection)
})

View File

@ -0,0 +1,24 @@
/// <reference path="../pb_data/types.d.ts" />
migrate((app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// add field
collection.fields.addAt(6, new Field({
"hidden": false,
"id": "bool826963346",
"name": "search_engine",
"presentable": false,
"required": false,
"system": false,
"type": "bool"
}))
return app.save(collection)
}, (app) => {
const collection = app.findCollectionByNameOrId("pbc_3385864241")
// remove field
collection.fields.removeById("bool826963346")
return app.save(collection)
})

View File

@ -11,13 +11,13 @@ sys.path.append(core_path)
# 现在可以直接导入模块因为core目录已经在Python路径中
from scrapers import *
from agents.get_info import pre_process
from utils.general_utils import is_chinese
from agents.get_info import get_author_and_publish_date, get_info, get_more_related_urls
from agents.get_info_prompts import *
benchmark_model = 'Qwen/Qwen2.5-72B-Instruct'
# benchmark_model = 'deepseek-chat'
# models = ['deepseek-reasoner']
models = ['Qwen/Qwen2.5-7B-Instruct', 'Qwen/Qwen2.5-14B-Instruct', 'Qwen/Qwen2.5-32B-Instruct', 'deepseek-ai/DeepSeek-V2.5']
async def main(sample: dict, include_ap: bool, prompts: list, focus_dict: dict, record_file: str):
@ -46,7 +46,7 @@ async def main(sample: dict, include_ap: bool, prompts: list, focus_dict: dict,
print(f"get more related urls time: {get_more_url_time}")
start_time = time.time()
infos = await get_info(contents, link_dict, [get_info_sys_prompt, get_info_suffix_prompt, model], focus_dict, author, publish_date, test_mode=True)
infos = await get_info(contents, link_dict, [get_info_sys_prompt, get_info_suffix_prompt, model], author, publish_date, test_mode=True)
get_info_time = time.time() - start_time
print(f"get info time: {get_info_time}")
@ -60,7 +60,7 @@ async def main(sample: dict, include_ap: bool, prompts: list, focus_dict: dict,
diff = f'差异{total_diff}个(遗漏{missing_in_cache}个,多出{extra_in_cache}个)'
related_urls_to_record = '\n'.join(more_url)
infos_to_record = [f"{fi['tag']}: {fi['content']}" for fi in infos]
infos_to_record = [fi['content'] for fi in infos]
infos_to_record = '\n'.join(infos_to_record)
with open(record_file, 'a') as f:
f.write(f"model: {model}\n")

View File

@ -10,6 +10,9 @@ sys.path.append(core_path)
from scrapers import *
from agents.get_info import pre_process
save_dir = 'webpage_samples'
def check_url_text(text):
common_chars = ',.!;:,;:、一二三四五六七八九十#*@% \t\n\r|*-_…>#'
print(f"processing: {text}")
@ -67,6 +70,9 @@ async def main(html_sample, record_file):
raw_markdown = result.content
used_img = result.images
title = result.title
if title == 'maybe a new_type_article':
print(f'\033[31mwe found a new type here, {record_file}\033[0m')
return
base_url = result.base
author = result.author
publish_date = result.publish_date
@ -118,7 +124,7 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--test_file', '-F', type=str, default='')
parser.add_argument('--sample_dir', '-D', type=str, default='')
parser.add_argument('--record_folder', '-R', type=str, default='')
parser.add_argument('--record_folder', '-R', type=str, default=save_dir)
args = parser.parse_args()
test_file = args.test_file

View File

@ -1 +1 @@
v0.3.7
v0.3.8

View File

@ -101,3 +101,23 @@ _注意如果你使用OrbStack直接用软件界面打开container的
作者写的很详细,尤其是接口部分,希望大家能够顺手给作者打个赏。
**<span style="color:red;">声明wiseflow项目目前不涉及也永远不会涉及任何对微信客户端的破解逆向等我们充分尊重并严格遵守微信的各项协议以及腾讯公司的知识产权也请广大用户知悉。</span>**
# 使用
完成上述部署可以通过 wxbot 获取信息后,请在本目录('wiseflow/weixin_mp')下创建 config.json 文件,内容如下(示意):
```json
{"01o1g6n53o14gu5": ["新华网", "__all__"]}
```
key 是关注点id此id 请从 pb 管理界面 http://127.0.0.1:8090/_/ 中的“focus_points”表中获取value 是此关注点对应的公众号名称所有来自此公众号的信息都会关联到这个focus point即提炼此关注点的内容
如果你想配置所有公众号文章都关联某一关注点,请在对应关注点的 value 中填入 `"__all__"` 这一项。
之后就可以在本目录下执行
```bash
python __init__.py
```
**注意:在这之前保证 pocketbase 和 wxbot 都已经启动**

View File

@ -7,25 +7,33 @@ import os, sys
core_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'core')
sys.path.append(core_path)
env_path = os.path.join(core_path, '.env')
from general_process import main_process, wiseflow_logger
from dotenv import load_dotenv
if os.path.exists(env_path):
print(f"loading env from {env_path}")
load_dotenv(env_path)
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
from general_process import main_process, wiseflow_logger, pb
from typing import Optional
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)
# 千万注意扫码登录时不要选择“同步历史消息”,否则会造成 bot 上来挨个回复历史消息
# 千万注意扫码登录时不要选择"同步历史消息",否则会造成 bot 上来挨个回复历史消息
# 先检查下 wx 的登录状态,同时获取已登录微信的 wxid
WX_BOT_ENDPOINT = os.environ.get('WX_BOT_ENDPOINT', '127.0.0.1:8066')
wx_url = f"http://{WX_BOT_ENDPOINT}/api/"
try:
# 发送GET请求
response = httpx.get(f"{wx_url}checklogin")
response.raise_for_status() # 检查HTTP响应状态码是否为200
# 解析JSON响应
data = response.json()
# 检查status字段
if data['data']['status'] == 1:
# 记录wxid
@ -51,16 +59,27 @@ wiseflow_logger.info(f"self_nickname: {self_nickname}")
# 注意这里要写公众号的原始id即 gh_ 开头的id, 可以通过历史 logger 获取
config_file = 'config.json'
if not os.path.exists(config_file):
config = None
wiseflow_logger.error("config.json not found, please create it in the same folder as this script")
raise ValueError(f"config.json not found, please create it in the same folder as this script")
else:
with open(config_file, 'r', encoding='utf-8') as f:
config = json.load(f)
focus_points = pb.read('focus_points', fields=['id', 'focuspoint', 'explanation'])
_dict = {point['id']: point for point in focus_points}
focus_dict = {}
defaut_focus = None
for key, value in config.items():
if "__all__" in value:
defaut_focus = _dict[key]
else:
for nickname in value:
focus_dict[nickname] = _dict[key]
#如下 pattern 仅适用于public msg的解析群内分享的公号文章不在此列
# The XML parsing scheme is not used because there are abnormal characters in the XML code extracted from the weixin public_msg
item_pattern = re.compile(r'<item>(.*?)</item>', re.DOTALL)
url_pattern = re.compile(r'<url><!\[CDATA\[(.*?)]]></url>')
appname_pattern = re.compile(r'<appname><!\[CDATA\[(.*?)]]></appname>')
async def get_public_msg(websocket_uri):
reconnect_attempts = 0
@ -71,16 +90,23 @@ async def get_public_msg(websocket_uri):
while True:
response = await websocket.recv()
datas = json.loads(response)
todo_urls = set()
for data in datas["data"]:
if "StrTalker" not in data or "Content" not in data:
if "Content" not in data:
wiseflow_logger.warning(f"invalid data:\n{data}")
continue
user_id = data["StrTalker"]
# user_id = data["StrTalker"]
appname_match = appname_pattern.search(data["Content"])
appname = appname_match.group(1).strip() if appname_match else None
if not appname:
wiseflow_logger.warning(f"can not find appname in \n{data['Content']}")
continue
focus = focus_dict.get(appname, defaut_focus)
if not focus:
wiseflow_logger.debug(f"{appname} related to no focus and there is no default focus")
continue
sites = []
items = item_pattern.findall(data["Content"])
# Iterate through all < item > content, extracting < url > and < summary >
for item in items:
url_match = url_pattern.search(item)
url = url_match.group(1) if url_match else None
@ -94,9 +120,10 @@ async def get_public_msg(websocket_uri):
url = url[:cut_off_point - 1]
# summary_match = summary_pattern.search(item)
# addition = summary_match.group(1) if summary_match else None
todo_urls.add(url)
if todo_urls:
await main_process(todo_urls)
sites.append({'url': url, 'type': 'web'})
if sites:
# 不等待任务完成,直接创建任务
asyncio.create_task(main_process(focus, sites))
except websockets.exceptions.ConnectionClosedError as e:
wiseflow_logger.error(f"Connection closed with exception: {e}")
reconnect_attempts += 1

1
weixin_mp/config.json Normal file
View File

@ -0,0 +1 @@
{"01o1g6n53o14gu5": ["新华网", "__all__"]}

View File

@ -1,18 +0,0 @@
#!/bin/bash
set -o allexport
source ../core/.env
set +o allexport
if ! pgrep -x "pocketbase" > /dev/null; then
if ! netstat -tuln | grep ":8090" > /dev/null && ! lsof -i :8090 > /dev/null; then
echo "Starting PocketBase..."
../pb/pocketbase serve --http=127.0.0.1:8090 &
else
echo "Port 8090 is already in use."
fi
else
echo "PocketBase is already running."
fi
python __init__.py