mirror of
https://github.com/TeamWiseFlow/wiseflow.git
synced 2025-01-23 02:20:20 +08:00
V0.3.5
This commit is contained in:
parent
cad383b0fe
commit
7752b4b3b4
@ -1,4 +1,4 @@
|
||||
# V0.3.2
|
||||
# V0.3.5
|
||||
- 引入 Crawlee(playwrigt模块),大幅提升通用爬取能力,适配实际项目场景;
|
||||
|
||||
Introduce Crawlee (playwright module), significantly enhancing general crawling capabilities and adapting to real-world task;
|
||||
|
@ -14,7 +14,6 @@ ADD https://github.com/pocketbase/pocketbase/releases/download/v0.23.4/pocketbas
|
||||
# for arm device
|
||||
# ADD https://github.com/pocketbase/pocketbase/releases/download/v0.23.4/pocketbase_0.23.4_linux_arm64.zip /tmp/pb.zip
|
||||
RUN unzip /tmp/pb.zip -d /pb/
|
||||
COPY pb/pb_migrations /pb/pb_migrations
|
||||
RUN apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
EXPOSE 8090
|
||||
|
190
README.md
190
README.md
@ -8,42 +8,34 @@
|
||||
|
||||
🌱看看首席情报官是如何帮您节省时间,过滤无关信息,并整理关注要点的吧!🌱
|
||||
|
||||
https://github.com/TeamWiseFlow/wiseflow/assets/96130569/bd4b2091-c02d-4457-9ec6-c072d8ddfb16
|
||||
|
||||
|
||||
## 🔥 隆重介绍 V0.3.2 版本
|
||||
## 🔥 隆重介绍 V0.3.5 版本
|
||||
|
||||
在充分听取社区反馈意见基础之上,我们重新提炼了 wiseflow 的产品定位,新定位更加精准也更加聚焦,V0.3.2版本即是该定位下的全新架构版本,相对于之前版本如下改进:
|
||||
在充分听取社区反馈意见基础之上,我们重新提炼了 wiseflow 的产品定位,新定位更加聚焦,V0.3.5版本即是该定位下的全新架构版本:
|
||||
|
||||
- 引入 [Crawlee](https://github.com/apify/crawlee-python) 基础爬虫架构,大幅提升页面获取能力。实测之前获取不到(包括获取为乱码的)页面目前都可以很好的获取了,后续大家碰到不能很好获取的页面,欢迎在 [issue #136](https://github.com/TeamWiseFlow/wiseflow/issues/136) 中进行反馈;
|
||||
- 新产品定位下全新的信息提取策略——“爬查一体”,放弃文章详细提取,全面使用 llm 直接从页面中提取用户感兴趣的信息(infos),同时自动判断值得跟进爬取的链接;
|
||||
- 引入 [Crawlee](https://github.com/apify/crawlee-python) 作为基础爬虫和任务管理框架,大幅提升页面获取能力。实测之前获取不到(包括获取为乱码的)页面目前都可以很好的获取了,后续大家碰到不能很好获取的页面,欢迎在 [issue #136](https://github.com/TeamWiseFlow/wiseflow/issues/136) 中进行反馈;
|
||||
- 新产品定位下全新的信息提取策略——“爬查一体”,放弃文章详细提取,爬取过程中即使用 llm 直接提取用户感兴趣的信息(infos),同时自动判断值得跟进爬取的链接,**你关注的才是你需要的**;
|
||||
- 适配最新版本(v0.23.4)的 Pocketbase,同时更新表单配置。另外新架构已经无需 GNE 等模块,requirement 依赖项目降低到8个;
|
||||
- 新架构部署方案也更加简便,docker 模式支持代码仓热更新,这意味着后续升级就无需再重复docker build了。
|
||||
- 更多细节,参考 [CHANGELOG](CHANGELOG.md)
|
||||
|
||||
🌟 注意:
|
||||
🌟 **V0.3.x 后续计划**
|
||||
|
||||
V0.3.2 架构和依赖上都较之前版本有诸多变化,因此请务必重新拉取代码仓,并参考最新的部署方案重新部署,V0.3.2支持python 环境源码使用、docker 容器部署,同时我们也即将上线免部署的服务网站,注册账号就可以直接使用,敬请期待!
|
||||
|
||||
V0.3.2 版本效果截图:
|
||||
|
||||
<img alt="sample.png" src="asset/sample.png" width="1024"/>
|
||||
|
||||
V0.3.x 后续计划中的升级内容还有:
|
||||
|
||||
- 尝试引入新的 mp_crawler, 公众号文章监控无需wxbot;
|
||||
- 引入对 RSS 信息源的支持;
|
||||
- 引入 [SeeAct](https://github.com/OSU-NLP-Group/SeeAct) 方案,通过视觉大模型提升 wiseflow 自主深入挖掘能力。
|
||||
- 引入 [SeeAct](https://github.com/OSU-NLP-Group/SeeAct) 方案,通过视觉大模型指导复杂页面的操作,如滚动、点击后出现信息等情况(V0.3.6);
|
||||
- 尝试支持微信公众号免wxbot订阅(V0.3.7);
|
||||
- 引入对 RSS 信息源的支持(V0.3.8);
|
||||
- 尝试引入 LLM 驱动的轻量级知识图谱,帮助用户从 infos 中建立洞察(V0.3.9)。
|
||||
|
||||
## ✋ wiseflow 与传统的爬虫工具、AI搜索、知识库(RAG)项目有何不同?
|
||||
|
||||
承蒙大家的厚爱,wiseflow自2024年6月底发布 V0.3.0版本来受到了开源社区的广泛关注,甚至吸引了不少自媒体的主动报道,在此首先表示感谢!
|
||||
wiseflow自2024年6月底发布 V0.3.0版本来受到了开源社区的广泛关注,甚至吸引了不少自媒体的主动报道,在此首先表示感谢!
|
||||
|
||||
但我们也注意到部分关注者对 wiseflow 的功能定位存在一些理解偏差,如下表格列出了 wiseflow 与传统爬虫工具、AI搜索、知识库(RAG)类项目的对比:
|
||||
但我们也注意到部分关注者对 wiseflow 的功能定位存在一些理解偏差,如下表格通过与传统爬虫工具、AI搜索、知识库(RAG)类项目的对比,代表了我们目前对于 wiseflow 产品最新定位思考。
|
||||
|
||||
| | 与 **首席情报官(Wiseflow)** 的比较说明|
|
||||
|-------------|-----------------|
|
||||
| **爬虫类工具** | 首先 wiseflow 是基于爬虫工具的项目(以目前的版本而言,wiseflow 集成了优秀的开源爬虫项目 Crwalee,而 Crawlee 的底层又是基于 beautifulsoup\playwright\httpx等大家耳熟能详的流行库)……但传统的爬虫工具都是面向开发者的,需要开发者手动去探索目标站点的结构,分析出要提取元素的 xpath 等,这不仅阻挡了普通用户,同时也毫无通用性可言,即对于不同网站(包括已有网站升级)都需要重做分析和探索。这个问题在 LLM 出现之前是无解的,而wiseflow致力的方向即是使用 LLM 自动化目标站点的分析和探索工作,从而实现“普通用户也可使用的通用爬虫”,从这个角度来说,你可以简单理解 wiseflow 为 “能自动使用爬虫工具的 AI 智能体” |
|
||||
| **爬虫类工具** | 首先 wiseflow 是基于爬虫工具的项目(以目前版本而言,我们基于爬虫框架 Crawlee),但传统的爬虫工具在信息提取方面需要人工的介入,提供明确的 Xpath 等信息……这不仅阻挡了普通用户,同时也毫无通用性可言,对于不同网站(包括已有网站升级后)都需要人工重做分析,并更新提取代码。wiseflow致力于使用 LLM 自动化网页的分析和提取工作,用户只要告诉程序他的关注点即可,从这个角度来说,可以简单理解 wiseflow 为 “能自动使用爬虫工具的 AI 智能体” |
|
||||
| **AI搜索** | AI搜索主要的应用场景是**具体问题的即时问答**,举例:”XX公司的创始人是谁“、“xx品牌下的xx产品哪里有售” ,用户要的是**一个答案**;wiseflow主要的应用场景是**某一方面信息的持续采集**,比如XX公司的关联信息追踪,XX品牌市场行为的持续追踪……在这些场景下,用户能提供关注点(某公司、某品牌)、甚至能提供信源(站点 url 等),但无法提出具体搜索问题,用户要的是**一系列相关信息**|
|
||||
| **知识库(RAG)类项目** | 知识库(RAG)类项目一般是基于已有信息的下游任务,并且一般面向的是私有知识(比如企业内的操作手册、产品手册、政府部门的文件等);wiseflow 目前并未整合下游任务,同时面向的是互联网上的公开信息,如果从“智能体”的角度来看,二者属于为不同目的而构建的智能体,RAG 类项目是“(内部)知识助理智能体”,而 wiseflow 则是“(外部)信息采集智能体”|
|
||||
|
||||
@ -57,31 +49,56 @@ V0.3.x 后续计划中的升级内容还有:
|
||||
git clone https://github.com/TeamWiseFlow/wiseflow.git
|
||||
```
|
||||
|
||||
### 2. 参考env_sample 配置 .env文件放置在 core 目录下
|
||||
### 2. 参考 env_sample 配置 .env 文件放置在 core 目录下
|
||||
|
||||
🌟 **这里与之前版本不同**,V0.3.2开始需要把 .env 放置在 core文件夹中。
|
||||
🌟 **这里与之前版本不同**,V0.3.5开始需要把 .env 放置在 core文件夹中。
|
||||
|
||||
另外 V0.3.2 起,env 配置也大幅简化了,必须的配置项目只有三项,具体如下:
|
||||
另外 V0.3.5 起,env 配置也大幅简化了,必须的配置项目只有三项,具体如下:
|
||||
|
||||
- LLM_API_KEY="" # 这还是你的大模型服务key,这是必须的
|
||||
- LLM_API_BASE="https://api.siliconflow.cn/v1" # 服务接口地址,任何支持 openai sdk 的服务商都可以(推荐 siliconflow),如果直接使用openai 的服务,这一项也可以不填
|
||||
- PB_API_AUTH="test@example.com|1234567890" # pocketbase 数据库的 superuser 用户名和密码,记得用 | 分隔
|
||||
- LLM_API_KEY=""
|
||||
|
||||
大模型服务key,这是必须的
|
||||
|
||||
- LLM_API_BASE="https://api.siliconflow.cn/v1"
|
||||
|
||||
服务接口地址,任何支持 openai sdk 的服务商都可以,如果直接使用openai 的服务,这一项也可以不填
|
||||
|
||||
- PB_API_AUTH="test@example.com|1234567890"
|
||||
|
||||
pocketbase 数据库的 superuser 用户名和密码,记得用 | 分隔
|
||||
|
||||
下面的都是可选配置:
|
||||
- #VERBOSE="true" # 是否开启观测模式,开启的话,不仅会把 debug log信息记录在 logger 文件上(模式仅是输出在 console 上),同时会开启 playwright 的浏览器窗口,方便你观察抓取过程,但同时会增加抓取速度;
|
||||
- #PRIMARY_MODEL="Qwen/Qwen2.5-7B-Instruct" # 主模型选择,在使用 siliconflow 服务的情况下,这一项不填就会默认调用Qwen2.5-7B-Instruct,实测基本也够用,但我更加**推荐 Qwen2.5-14B-Instruct**
|
||||
- #SECONDARY_MODEL="THUDM/glm-4-9b-chat" # 副模型选择,在使用 siliconflow 服务的情况下,这一项不填就会默认调用glm-4-9b-chat。
|
||||
- #PROJECT_DIR="work_dir" # 项目运行数据目录,不配置的话,默认在 `core/work_dir` ,注意:目前整个 core 目录是挂载到 container 下的,所以意味着你可以直接访问这里。
|
||||
- #PB_API_BASE"="" # 只有当你的 pocketbase 不运行在默认ip 或端口下才需要配置,默认情况下忽略就行。
|
||||
- #VERBOSE="true"
|
||||
|
||||
是否开启观测模式,开启的话,不仅会把 debug log信息记录在 logger 文件上(默认仅输出在 console 上),同时会开启 playwright 的浏览器窗口,方便观察抓取过程;
|
||||
|
||||
### 3.1 使用docker构筑 image 运行
|
||||
- #PRIMARY_MODEL="Qwen/Qwen2.5-7B-Instruct"
|
||||
|
||||
主模型选择,在使用 siliconflow 服务的情况下,这一项不填就会默认调用Qwen2.5-7B-Instruct,实测基本也够用,但我更加**推荐 Qwen2.5-14B-Instruct**
|
||||
|
||||
- #SECONDARY_MODEL="THUDM/glm-4-9b-chat"
|
||||
|
||||
副模型选择,在使用 siliconflow 服务的情况下,这一项不填就会默认调用glm-4-9b-chat。
|
||||
|
||||
- #PROJECT_DIR="work_dir"
|
||||
|
||||
项目运行数据目录,不配置的话,默认在 `core/work_dir` ,注意:目前整个 core 目录是挂载到 container 下的,所以意味着你可以直接访问这里。
|
||||
|
||||
- #PB_API_BASE=""
|
||||
|
||||
只有当你的 pocketbase 不运行在默认ip 或端口下才需要配置,默认情况下忽略就行。
|
||||
|
||||
### 3.1 使用docker运行
|
||||
|
||||
✋ V0.3.5版本架构和依赖与之前版本有较大不同,请务必重新拉取代码,删除旧版本镜像(包括外挂的 pb_data 文件夹),重新build!
|
||||
|
||||
对于国内用户,可以先配置镜像源:
|
||||
|
||||
最新可用 docker 镜像加速地址参考:[参考1](https://github.com/dongyubin/DockerHub) [参考2](https://www.coderjia.cn/archives/dba3f94c-a021-468a-8ac6-e840f85867ea)
|
||||
|
||||
🌟 **三方镜像,风险自担。**
|
||||
**三方镜像,风险自担。**
|
||||
|
||||
之后
|
||||
|
||||
```bash
|
||||
cd wiseflow
|
||||
@ -90,14 +107,14 @@ docker compose up
|
||||
|
||||
**注意:**
|
||||
|
||||
第一次运行docker container时可能会遇到报错,这其实是正常现象,因为你尚未为pb仓库创建 super user 账号。
|
||||
|
||||
此时请保持container不关闭状态,浏览器打开`http://127.0.0.1:8090/_/ `,按提示创建 super user 账号(一定要使用邮箱),然后将创建的用户名密码填入.env文件,重启container即可。
|
||||
第一次运行docker container时程序可能会报错,这是正常现象,请按屏幕提示创建 super user 账号(一定要使用邮箱),然后将创建的用户名密码填入.env文件,重启container即可。
|
||||
|
||||
🌟 docker运行默认进入 task
|
||||
🌟 docker方案默认运行 task.py ,即会周期性执行爬取-提取任务(启动时会立即先执行一次,之后每隔一小时启动一次)
|
||||
|
||||
### 3.2 使用python环境运行
|
||||
|
||||
✋ V0.3.5版本架构和依赖与之前版本有较大不同,请务必重新拉取代码,删除(或重建)pb_data
|
||||
|
||||
推荐使用 conda 构建虚拟环境
|
||||
|
||||
```bash
|
||||
@ -108,89 +125,65 @@ cd core
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
之后可以参考core/scripts 中的脚本分别启动pb、task和backend (将脚本文件移动到core目录下)
|
||||
|
||||
**注意:**
|
||||
- 一定要先启动pb,至于task和backend是独立进程,先后顺序无所谓,也可以按需求只启动其中一个;
|
||||
- 需要先去这里 https://pocketbase.io/docs/ 下载对应自己设备的pocketbase客户端,并放置在 /core/pb 目录下
|
||||
- pb运行问题(包括首次运行报错等)参考 [core/pb/README.md](/pb/README.md)
|
||||
- 使用前请创建并编辑.env文件,放置在wiseflow代码仓根目录(core目录的上级),.env文件可以参考env_sample,详细配置说明见下
|
||||
之后去这里 [下载](https://pocketbase.io/docs/) 对应的 pocketbase 客户端,放置到 [/pb](/pb) 目录下。然后
|
||||
|
||||
📚 for developer, see [/core/README.md](/core/README.md) for more
|
||||
|
||||
通过 pocketbase 访问获取的数据:
|
||||
- http://127.0.0.1:8090/_/ - Admin dashboard UI
|
||||
- http://127.0.0.1:8090/api/ - REST API
|
||||
|
||||
```bash
|
||||
chmod +x run.sh
|
||||
./run_task.sh # if you just want to scan sites one-time (no loop), use ./run.sh
|
||||
```
|
||||
|
||||
### 3. 配置
|
||||
这个脚本会自动判断 pocketbase 是否已经在运行,如果未运行,会自动拉起。但是请注意,当你 ctrl+c 或者 ctrl+z 终止进程时,pocketbase 进程不会被终止,直到你关闭terminal。
|
||||
|
||||
复制目录下的env_sample,并改名为.env, 参考如下 填入你的配置信息(LLM服务token等)
|
||||
|
||||
**windows用户如果选择直接运行python程序,可以直接在 “开始 - 设置 - 系统 - 关于 - 高级系统设置 - 环境变量“ 中设置如下项目,设置后需要重启终端生效**
|
||||
另外与 docker 部署一样,第一次运行时可能会出现报错,请按屏幕提示创建 super user 账号(一定要使用邮箱),然后将创建的用户名密码填入.env文件,再次运行即可。
|
||||
|
||||
- LLM_API_KEY # 大模型推理服务API KEY
|
||||
- LLM_API_BASE # 本项目依赖openai sdk,只要模型服务支持openai接口,就可以通过配置该项正常使用,如使用openai服务,删除这一项即可
|
||||
- WS_LOG="verbose" # 设定是否开始debug观察,如无需要,删除即可
|
||||
- GET_INFO_MODEL # 信息提炼与标签匹配任务模型,默认为 gpt-4o-mini-2024-07-18
|
||||
- REWRITE_MODEL # 近似信息合并改写任务模型,默认为 gpt-4o-mini-2024-07-18
|
||||
- HTML_PARSE_MODEL # 网页解析模型(GNE算法效果不佳时智能启用),默认为 gpt-4o-mini-2024-07-18
|
||||
- PROJECT_DIR # 数据、缓存以及日志文件存储位置,相对于代码仓的相对路径,默认不填就在代码仓
|
||||
- PB_API_AUTH='email|password' # pb数据库admin的邮箱和密码(注意一定是邮箱,可以是虚构的邮箱)
|
||||
- PB_API_BASE # 正常使用无需这一项,只有当你不使用默认的pocketbase本地接口(8090)时才需要
|
||||
|
||||
|
||||
### 4. 模型推荐 [2024-09-03]
|
||||
当然你也可以在另一个 terminal 提前运行并设定 pocketbase(这会避免第一次的报错),具体可以参考 [pb/README.md](/pb/README.md)
|
||||
|
||||
经过反复测试(中英文任务)**GET_INFO_MODEL**、**REWRITE_MODEL**、**HTML_PARSE_MODEL** 三项最小可用模型分别为:**"THUDM/glm-4-9b-chat"**、**"Qwen/Qwen2-7B-Instruct"**、**"Qwen/Qwen2-7B-Instruct"**
|
||||
|
||||
目前,SiliconFlow已经官宣Qwen2-7B-Instruct、glm-4-9b-chat在线推理服务免费,这意味着您可以“零成本”使用wiseflow啦!
|
||||
### 4. 模型推荐 [2024-12-09]
|
||||
|
||||
虽然参数量越大的模型意味着更佳的性能,但经过实测,**使用 Qwen2.5-7b-Instruct 和 glm-4-9b-chat 模型,即可以达到基本的效果**。不过综合考虑成本、速度和效果,我更加推荐主模型
|
||||
**(PRIMARY_MODEL)使用Qwen2.5-14B-Instruct**。
|
||||
|
||||
这里依然强烈推荐使用 siliconflow(硅基流动)的 MaaS 服务,提供多个主流开源模型的服务,量大管饱,Qwen2.5-7b-Instruct 和 glm-4-9b-chat 目前提供免费服务。(主模型使用Qwen2.5-14B-Instruct情况下,爬取374个网页,有效抽取43条 info,总耗费¥3.07)
|
||||
|
||||
😄 如果您愿意,可以使用我的[siliconflow邀请链接](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92),这样我也可以获得更多token奖励 😄
|
||||
😄 如果您愿意,可以使用我的[siliconflow邀请链接](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92),这样我也可以获得更多token奖励 🌹
|
||||
|
||||
⚠️ **V0.3.1更新**
|
||||
|
||||
如果您使用带explaination的复杂tag,那么glm-4-9b-chat规模的模型是无法保证准确理解的,目前测试下来针对该类型任务效果比较好的模型为 **Qwen/Qwen2-72B-Instruct** 和 **gpt-4o-mini-2024-07-18** 。
|
||||
**如果您的信源多为非中文页面,且也不要求提取出的 info 为中文,那么更推荐您使用 openai 或者 claude 等海外厂家的模型。**
|
||||
|
||||
针对有需求使用 `gpt-4o-mini-2024-07-18` 的用户,可以尝试第三方代理 **AiHubMix**,支持国内网络环境直连、支付宝充值(实际费率相当于官网86折)
|
||||
您可以尝试第三方代理 **AiHubMix**,支持国内网络环境直连、支付宝便捷支付,免去封号风险;
|
||||
|
||||
🌹 欢迎使用如下邀请链接 [AiHubMix邀请链接](https://aihubmix.com?aff=Gp54) 注册 🌹
|
||||
😄 欢迎使用如下邀请链接 [AiHubMix邀请链接](https://aihubmix.com?aff=Gp54) 注册 🌹
|
||||
|
||||
🌍 上述两个平台的在线推理服务均兼容openai SDK,配置`.env `的`LLM_API_BASE`和`LLM_API_KEY`后即可使用。
|
||||
🌟 **请注意 wiseflow 本身并不限定任何模型服务,只要服务兼容 openAI SDK 即可,包括本地部署的 ollama、Xinference 等服务**
|
||||
|
||||
|
||||
### 5. **关注点和定时扫描信源添加**
|
||||
|
||||
启动程序后,打开pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)
|
||||
|
||||
#### 5.1 打开 tags表单
|
||||
#### 5.1 打开 focus_point 表单
|
||||
|
||||
通过这个表单可以指定你的关注点,LLM会按此提炼、过滤并分类信息。
|
||||
|
||||
tags 字段说明:
|
||||
- name, 关注点名称
|
||||
- explaination,关注点的详细解释或具体约定,如 “仅限上海市官方发布的初中升学信息”(tag name为 上海初中升学信息)
|
||||
- activated, 是否激活。如果关闭则会忽略该关注点,关闭后可再次开启。开启和关闭无需重启docker容器,会在下一次定时任务时更新。
|
||||
字段说明:
|
||||
- focuspoint, 关注点描述(必填),如”上海小升初信息“、”加密货币价格“
|
||||
- explanation,关注点的详细解释或具体约定,如 “仅限上海市官方发布的初中升学信息”、“BTC、ETH 的现价、涨跌幅数据“等
|
||||
- activated, 是否激活。如果关闭则会忽略该关注点,关闭后可再次开启。
|
||||
|
||||
注意:focus_point 更新设定(包括 activated 调整)后,**需要重启程序才会生效。**
|
||||
|
||||
#### 5.2 打开 sites表单
|
||||
|
||||
通过这个表单可以指定自定义信源,系统会启动后台定时任务,在本地执行信源扫描、解析和分析。
|
||||
|
||||
sites 字段说明:
|
||||
- url, 信源的url,信源无需给定具体文章页面,给文章列表页面即可。
|
||||
- per_hours, 扫描频率,单位为小时,类型为整数(1~24范围,我们建议扫描频次不要超过一天一次,即设定为24)
|
||||
- activated, 是否激活。如果关闭则会忽略该信源,关闭后可再次开启。开启和关闭无需重启docker容器,会在下一次定时任务时更新。
|
||||
- url, 信源的url,信源无需给定具体文章页面,给文章列表页面即可。
|
||||
- per_hours, 扫描频率,单位为小时,类型为整数(1~24范围,我们建议扫描频次不要超过一天一次,即设定为24)
|
||||
- activated, 是否激活。如果关闭则会忽略该信源,关闭后可再次开启。
|
||||
|
||||
**sites 的设定调整,无需重启程序。**
|
||||
|
||||
|
||||
### 6. 本地部署
|
||||
|
||||
如您所见,本项目最低仅需使用7b\9b大小的LLM,且无需任何向量模型,这就意味着仅仅需要一块3090RTX(24G显存)就可以完全的对本项目进行本地化部署。
|
||||
|
||||
请保证您的本地化部署LLM服务兼容openai SDK,并配置 LLM_API_BASE 即可。
|
||||
|
||||
注:若需让7b~9b规模的LLM可以实现对tag explaination的准确理解,推荐使用dspy进行prompt优化,但这需要累积约50条人工标记数据。详见 [DSPy](https://dspy-docs.vercel.app/)
|
||||
|
||||
## 🔄 如何在您自己的程序中使用 wiseflow 抓取出的数据
|
||||
## 📚 如何在您自己的程序中使用 wiseflow 抓取出的数据
|
||||
|
||||
1、参考 [dashbord](dashboard) 部分源码二次开发。
|
||||
|
||||
@ -211,19 +204,22 @@ PocketBase作为流行的轻量级数据库,目前已有 Go/Javascript/Python
|
||||
|
||||
商用以及定制合作,请联系 **Email:35252986@qq.com**
|
||||
|
||||
- 商用客户请联系我们报备登记,产品承诺永远免费。
|
||||
- 商用客户请联系我们报备登记,产品承诺永远免费。
|
||||
|
||||
|
||||
## 📬 联系方式
|
||||
|
||||
有任何问题或建议,欢迎通过 [issue](https://github.com/TeamWiseFlow/wiseflow/issues) 与我们联系。
|
||||
有任何问题或建议,欢迎通过 [issue](https://github.com/TeamWiseFlow/wiseflow/issues) 留言。
|
||||
|
||||
|
||||
## 🤝 本项目基于如下优秀的开源项目:
|
||||
|
||||
- GeneralNewsExtractor ( General Extractor of News Web Page Body Based on Statistical Learning) https://github.com/GeneralNewsExtractor/GeneralNewsExtractor
|
||||
- crawlee-python (A web scraping and browser automation library for Python to build reliable crawlers. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.) https://github.com/apify/crawlee-python
|
||||
- json_repair(Repair invalid JSON documents ) https://github.com/josdejong/jsonrepair/tree/main
|
||||
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase
|
||||
- SeeAct(a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4Vision.) https://github.com/OSU-NLP-Group/SeeAct
|
||||
|
||||
同时受 [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor)、[AutoCrawler](https://github.com/kingname/AutoCrawler) 启发。
|
||||
|
||||
## Citation
|
||||
|
||||
@ -233,4 +229,4 @@ PocketBase作为流行的轻量级数据库,目前已有 Go/Javascript/Python
|
||||
Author:Wiseflow Team
|
||||
https://github.com/TeamWiseFlow/wiseflow
|
||||
Licensed under Apache2.0
|
||||
```
|
||||
```
|
BIN
asset/sample.png
BIN
asset/sample.png
Binary file not shown.
Before Width: | Height: | Size: 533 KiB |
@ -10,4 +10,5 @@ services:
|
||||
- 8090:8090
|
||||
volumes:
|
||||
- ./core:/app
|
||||
- ./pb/pb_data:/pb/pb_data
|
||||
- ./pb/pb_data:/pb/pb_data
|
||||
- ./pb/pb_migrations:/pb/pb_migrations
|
@ -23,7 +23,7 @@ class GeneralInfoExtractor:
|
||||
focus = input('It seems you have not set any focus point, WiseFlow need the specific focus point to guide the following info extract job.'
|
||||
'so please input one now. describe what info you care about shortly: ')
|
||||
explanation = input('Please provide more explanation for the focus point (if not necessary, pls just type enter: ')
|
||||
focus_data.append({"name": focus, "explanation": explanation,
|
||||
focus_data.append({"focuspoint": focus, "explanation": explanation,
|
||||
"id": pb.add('focus_points', {"focuspoint": focus, "explanation": explanation})})
|
||||
|
||||
# self.focus_list = [item["focuspoint"] for item in focus_data]
|
||||
@ -188,6 +188,9 @@ url2
|
||||
if item['focus'] not in self.focus_dict:
|
||||
self.logger.warning(f"{item['focus']} not in focus_list, it's model's Hallucination")
|
||||
continue
|
||||
if not item['content']:
|
||||
continue
|
||||
|
||||
if item['content'] in link_dict:
|
||||
self.logger.debug(f"{item['content']} in link_dict, aborting")
|
||||
continue
|
||||
|
@ -5,7 +5,7 @@ source .env
|
||||
set +o allexport
|
||||
|
||||
# 启动 PocketBase
|
||||
/pb/pocketbase serve --http=0.0.0.0:8090 &
|
||||
/pb/pocketbase serve --http=127.0.0.1:8090 &
|
||||
pocketbase_pid=$!
|
||||
|
||||
# 启动 Python 任务
|
||||
|
@ -22,29 +22,13 @@ screenshot_dir = os.path.join(project_dir, 'crawlee_storage', 'screenshots')
|
||||
wiseflow_logger = get_logger('general_process', project_dir)
|
||||
pb = PbTalker(wiseflow_logger)
|
||||
gie = GeneralInfoExtractor(pb, wiseflow_logger)
|
||||
existing_urls = {url['url'] for url in pb.read(collection_name='articles', fields=['url']) if url['url']}
|
||||
existing_urls = {url['url'] for url in pb.read(collection_name='infos', fields=['url'])}
|
||||
|
||||
|
||||
async def save_to_pb(article: dict, infos: list):
|
||||
async def save_to_pb(url: str, infos: list):
|
||||
# saving to pb process
|
||||
screenshot = article.pop('screenshot') if 'screenshot' in article else None
|
||||
article_id = pb.add(collection_name='articles', body=article)
|
||||
if not article_id:
|
||||
wiseflow_logger.error('add article failed, writing to cache_file')
|
||||
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
|
||||
with open(os.path.join(project_dir, f'{timestamp}_cache_article.json'), 'w', encoding='utf-8') as f:
|
||||
json.dump(article, f, ensure_ascii=False, indent=4)
|
||||
return
|
||||
if screenshot:
|
||||
file = open(screenshot, 'rb')
|
||||
file_name = os.path.basename(screenshot)
|
||||
message = pb.upload('articles', article_id, 'screenshot', file_name, file)
|
||||
file.close()
|
||||
if not message:
|
||||
wiseflow_logger.warning(f'{article_id} upload screenshot failed, file location: {screenshot}')
|
||||
|
||||
for info in infos:
|
||||
info['articles'] = [article_id]
|
||||
info['url'] = url
|
||||
_ = pb.add(collection_name='infos', body=info)
|
||||
if not _:
|
||||
wiseflow_logger.error('add info failed, writing to cache_file')
|
||||
@ -84,23 +68,25 @@ async def request_handler(context: PlaywrightCrawlingContext) -> None:
|
||||
context.log.info(f'routed to customer scraper for {domain}')
|
||||
try:
|
||||
article, more_urls, infos = await custom_scraper_map[domain](html, context.request.url)
|
||||
if not article and not infos and not more_urls:
|
||||
wiseflow_logger.warning(f'{parsed_url} handled by customer scraper, bot got nothing')
|
||||
except Exception as e:
|
||||
context.log.error(f'error occurred: {e}')
|
||||
wiseflow_logger.warning(f'handle {parsed_url} failed by customer scraper, this url will be skipped')
|
||||
return
|
||||
wiseflow_logger.warning(f'handle {parsed_url} failed by customer scraper, so no info can be found')
|
||||
article, infos, more_urls = {}, [], set()
|
||||
|
||||
if not article and not infos and not more_urls:
|
||||
wiseflow_logger.warning(f'{parsed_url} handled by customer scraper, bot got nothing')
|
||||
return
|
||||
|
||||
#title = article.get('title', "")
|
||||
link_dict = more_urls if isinstance(more_urls, dict) else {}
|
||||
related_urls = more_urls if isinstance(more_urls, set) else set()
|
||||
if not infos and not related_urls:
|
||||
text = article.pop('content') if 'content' in article else None
|
||||
try:
|
||||
text = article.get('content', '')
|
||||
except Exception as e:
|
||||
wiseflow_logger.warning(f'customer scraper output article is not valid dict: {e}')
|
||||
text = ''
|
||||
|
||||
if not text:
|
||||
wiseflow_logger.warning(f'no content found in {parsed_url} by customer scraper, cannot use llm GIE')
|
||||
author, publish_date = '', ''
|
||||
wiseflow_logger.warning(f'no content found in {parsed_url} by customer scraper, cannot use llm GIE, aborting')
|
||||
infos, related_urls = [], set()
|
||||
else:
|
||||
author = article.get('author', '')
|
||||
publish_date = article.get('publish_date', '')
|
||||
@ -109,28 +95,28 @@ async def request_handler(context: PlaywrightCrawlingContext) -> None:
|
||||
infos, related_urls, author, publish_date = await gie(text, link_dict, context.request.url, author, publish_date)
|
||||
except Exception as e:
|
||||
wiseflow_logger.error(f'gie error occurred in processing: {e}')
|
||||
infos = []
|
||||
author, publish_date = '', ''
|
||||
else:
|
||||
author = article.get('author', '')
|
||||
publish_date = article.get('publish_date', '')
|
||||
infos, related_urls = [], set()
|
||||
else:
|
||||
# Extract data from the page.
|
||||
# future work: try to use a visual-llm do all the job...
|
||||
text = await context.page.inner_text('body')
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
links = soup.find_all('a', href=True)
|
||||
parsed_url = urlparse(context.request.url)
|
||||
domain = parsed_url.netloc
|
||||
base_url = f"{parsed_url.scheme}://{domain}"
|
||||
|
||||
link_dict = {}
|
||||
for a in links:
|
||||
new_url = a.get('href')
|
||||
if new_url.startswith('javascript:') or new_url.startswith('#') or new_url.startswith('mailto:'):
|
||||
continue
|
||||
if new_url in [context.request.url, base_url]:
|
||||
continue
|
||||
if new_url in existing_urls:
|
||||
continue
|
||||
t = a.text.strip()
|
||||
if new_url and t and new_url != base_url and new_url not in existing_urls:
|
||||
if new_url and t:
|
||||
link_dict[t] = urljoin(base_url, new_url)
|
||||
existing_urls.add(new_url)
|
||||
|
||||
publish_date = soup.find('div', class_='date').get_text(strip=True) if soup.find('div', class_='date') else None
|
||||
if publish_date:
|
||||
publish_date = extract_and_convert_dates(publish_date)
|
||||
@ -139,27 +125,19 @@ async def request_handler(context: PlaywrightCrawlingContext) -> None:
|
||||
author = soup.find('div', class_='source').get_text(strip=True) if soup.find('div', class_='source') else None
|
||||
# get infos by llm
|
||||
infos, related_urls, author, publish_date = await gie(text, link_dict, base_url, author, publish_date)
|
||||
# title = await context.page.title()
|
||||
|
||||
screenshot_file_name = f"{hashlib.sha256(context.request.url.encode()).hexdigest()}.png"
|
||||
await context.page.screenshot(path=os.path.join(screenshot_dir, screenshot_file_name), full_page=True)
|
||||
wiseflow_logger.debug(f'screenshot saved to {screenshot_file_name}')
|
||||
|
||||
if infos:
|
||||
article = {
|
||||
'url': context.request.url,
|
||||
# 'title': title,
|
||||
'author': author,
|
||||
'publish_date': publish_date,
|
||||
'screenshot': os.path.join(screenshot_dir, screenshot_file_name),
|
||||
'tags': [info['tag'] for info in infos]
|
||||
}
|
||||
await save_to_pb(article, infos)
|
||||
await save_to_pb(context.request.url, infos)
|
||||
|
||||
if related_urls:
|
||||
await context.add_requests(list(related_urls))
|
||||
|
||||
# todo: use llm to determine next action
|
||||
"""
|
||||
screenshot_file_name = f"{hashlib.sha256(context.request.url.encode()).hexdigest()}.png"
|
||||
await context.page.screenshot(path=os.path.join(screenshot_dir, screenshot_file_name), full_page=True)
|
||||
wiseflow_logger.debug(f'screenshot saved to {screenshot_file_name}')
|
||||
"""
|
||||
|
||||
if __name__ == '__main__':
|
||||
sites = pb.read('sites', filter='activated=True')
|
||||
|
@ -7,7 +7,7 @@ set +o allexport
|
||||
if ! pgrep -x "pocketbase" > /dev/null; then
|
||||
if ! netstat -tuln | grep ":8090" > /dev/null && ! lsof -i :8090 > /dev/null; then
|
||||
echo "Starting PocketBase..."
|
||||
../pb/pocketbase serve --http=0.0.0.0:8090 &
|
||||
../pb/pocketbase serve --http=127.0.0.1:8090 &
|
||||
else
|
||||
echo "Port 8090 is already in use."
|
||||
fi
|
||||
|
18
core/run_task.sh
Executable file
18
core/run_task.sh
Executable file
@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -o allexport
|
||||
source .env
|
||||
set +o allexport
|
||||
|
||||
if ! pgrep -x "pocketbase" > /dev/null; then
|
||||
if ! netstat -tuln | grep ":8090" > /dev/null && ! lsof -i :8090 > /dev/null; then
|
||||
echo "Starting PocketBase..."
|
||||
../pb/pocketbase serve --http=127.0.0.1:8090 &
|
||||
else
|
||||
echo "Port 8090 is already in use."
|
||||
fi
|
||||
else
|
||||
echo "PocketBase is already running."
|
||||
fi
|
||||
|
||||
python tasks.py
|
@ -6,20 +6,21 @@ counter = 1
|
||||
|
||||
async def schedule_pipeline(interval):
|
||||
global counter
|
||||
wiseflow_logger.info(f'task execute loop {counter}')
|
||||
sites = pb.read('sites', filter='activated=True')
|
||||
todo_urls = set()
|
||||
for site in sites:
|
||||
if not site['per_hours'] or not site['url']:
|
||||
continue
|
||||
if counter % site['per_hours'] == 0:
|
||||
wiseflow_logger.info(f"applying {site['url']}")
|
||||
todo_urls.add(site['url'].rstrip('/'))
|
||||
while True:
|
||||
wiseflow_logger.info(f'task execute loop {counter}')
|
||||
sites = pb.read('sites', filter='activated=True')
|
||||
todo_urls = set()
|
||||
for site in sites:
|
||||
if not site['per_hours'] or not site['url']:
|
||||
continue
|
||||
if counter % site['per_hours'] == 0:
|
||||
wiseflow_logger.info(f"applying {site['url']}")
|
||||
todo_urls.add(site['url'].rstrip('/'))
|
||||
|
||||
counter += 1
|
||||
await crawler.run(list[todo_urls])
|
||||
wiseflow_logger.info(f'task execute loop finished, work after {interval} seconds')
|
||||
await asyncio.sleep(interval)
|
||||
counter += 1
|
||||
await crawler.run(list(todo_urls))
|
||||
wiseflow_logger.info(f'task execute loop finished, work after {interval} seconds')
|
||||
await asyncio.sleep(interval)
|
||||
|
||||
|
||||
async def main():
|
||||
|
@ -4,6 +4,6 @@ download https://github.com/pocketbase/pocketbase/releases/download/v0.23.4/
|
||||
cd pb
|
||||
xattr -d com.apple.quarantine pocketbase # for Macos
|
||||
./pocketbase migrate up # for first run
|
||||
./pocketbase --dev admin create test@example.com 123467890 # If you don't have an initial account, please use this command to create it
|
||||
./pocketbase --dev admin create test@example.com 1234567890 # If you don't have an initial account, please use this command to create it
|
||||
./pocketbase serve
|
||||
```
|
@ -1,121 +0,0 @@
|
||||
/// <reference path="../pb_data/types.d.ts" />
|
||||
migrate((app) => {
|
||||
const collection = new Collection({
|
||||
"createRule": null,
|
||||
"deleteRule": null,
|
||||
"fields": [
|
||||
{
|
||||
"autogeneratePattern": "[a-z0-9]{15}",
|
||||
"hidden": false,
|
||||
"id": "text3208210256",
|
||||
"max": 15,
|
||||
"min": 15,
|
||||
"name": "id",
|
||||
"pattern": "^[a-z0-9]+$",
|
||||
"presentable": false,
|
||||
"primaryKey": true,
|
||||
"required": true,
|
||||
"system": true,
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"exceptDomains": null,
|
||||
"hidden": false,
|
||||
"id": "url4101391790",
|
||||
"name": "url",
|
||||
"onlyDomains": null,
|
||||
"presentable": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "url"
|
||||
},
|
||||
{
|
||||
"autogeneratePattern": "",
|
||||
"hidden": false,
|
||||
"id": "text724990059",
|
||||
"max": 0,
|
||||
"min": 0,
|
||||
"name": "title",
|
||||
"pattern": "",
|
||||
"presentable": false,
|
||||
"primaryKey": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"autogeneratePattern": "",
|
||||
"hidden": false,
|
||||
"id": "text3182418120",
|
||||
"max": 0,
|
||||
"min": 0,
|
||||
"name": "author",
|
||||
"pattern": "",
|
||||
"presentable": false,
|
||||
"primaryKey": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"hidden": false,
|
||||
"id": "date2025149370",
|
||||
"max": "",
|
||||
"min": "",
|
||||
"name": "publish_date",
|
||||
"presentable": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "date"
|
||||
},
|
||||
{
|
||||
"hidden": false,
|
||||
"id": "file1486429761",
|
||||
"maxSelect": 1,
|
||||
"maxSize": 0,
|
||||
"mimeTypes": [],
|
||||
"name": "screenshot",
|
||||
"presentable": false,
|
||||
"protected": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"thumbs": [],
|
||||
"type": "file"
|
||||
},
|
||||
{
|
||||
"hidden": false,
|
||||
"id": "autodate2990389176",
|
||||
"name": "created",
|
||||
"onCreate": true,
|
||||
"onUpdate": false,
|
||||
"presentable": false,
|
||||
"system": false,
|
||||
"type": "autodate"
|
||||
},
|
||||
{
|
||||
"hidden": false,
|
||||
"id": "autodate3332085495",
|
||||
"name": "updated",
|
||||
"onCreate": true,
|
||||
"onUpdate": true,
|
||||
"presentable": false,
|
||||
"system": false,
|
||||
"type": "autodate"
|
||||
}
|
||||
],
|
||||
"id": "pbc_4287850865",
|
||||
"indexes": [],
|
||||
"listRule": null,
|
||||
"name": "articles",
|
||||
"system": false,
|
||||
"type": "base",
|
||||
"updateRule": null,
|
||||
"viewRule": null
|
||||
});
|
||||
|
||||
return app.save(collection);
|
||||
}, (app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_4287850865");
|
||||
|
||||
return app.delete(collection);
|
||||
})
|
@ -1,28 +0,0 @@
|
||||
/// <reference path="../pb_data/types.d.ts" />
|
||||
migrate((app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_4287850865")
|
||||
|
||||
// add field
|
||||
collection.fields.addAt(6, new Field({
|
||||
"cascadeDelete": false,
|
||||
"collectionId": "pbc_3385864241",
|
||||
"hidden": false,
|
||||
"id": "relation1874629670",
|
||||
"maxSelect": 999,
|
||||
"minSelect": 0,
|
||||
"name": "tags",
|
||||
"presentable": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "relation"
|
||||
}))
|
||||
|
||||
return app.save(collection)
|
||||
}, (app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_4287850865")
|
||||
|
||||
// remove field
|
||||
collection.fields.removeById("relation1874629670")
|
||||
|
||||
return app.save(collection)
|
||||
})
|
@ -45,19 +45,6 @@ migrate((app) => {
|
||||
"system": false,
|
||||
"type": "relation"
|
||||
},
|
||||
{
|
||||
"cascadeDelete": false,
|
||||
"collectionId": "pbc_4287850865",
|
||||
"hidden": false,
|
||||
"id": "relation3218944360",
|
||||
"maxSelect": 999,
|
||||
"minSelect": 0,
|
||||
"name": "articles",
|
||||
"presentable": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"type": "relation"
|
||||
},
|
||||
{
|
||||
"hidden": false,
|
||||
"id": "file3291445124",
|
||||
|
39
pb/pb_migrations/1733753289_updated_infos.js
Normal file
39
pb/pb_migrations/1733753289_updated_infos.js
Normal file
@ -0,0 +1,39 @@
|
||||
/// <reference path="../pb_data/types.d.ts" />
|
||||
migrate((app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_629947526")
|
||||
|
||||
// add field
|
||||
collection.fields.addAt(4, new Field({
|
||||
"exceptDomains": [],
|
||||
"hidden": false,
|
||||
"id": "url4101391790",
|
||||
"name": "url",
|
||||
"onlyDomains": [],
|
||||
"presentable": false,
|
||||
"required": true,
|
||||
"system": false,
|
||||
"type": "url"
|
||||
}))
|
||||
|
||||
// add field
|
||||
collection.fields.addAt(5, new Field({
|
||||
"hidden": false,
|
||||
"id": "file1486429761",
|
||||
"maxSelect": 1,
|
||||
"maxSize": 0,
|
||||
"mimeTypes": [],
|
||||
"name": "screenshot",
|
||||
"presentable": false,
|
||||
"protected": false,
|
||||
"required": false,
|
||||
"system": false,
|
||||
"thumbs": [],
|
||||
"type": "file"
|
||||
}))
|
||||
|
||||
return app.save(collection)
|
||||
}, (app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_629947526")
|
||||
|
||||
return app.save(collection)
|
||||
})
|
42
pb/pb_migrations/1733753354_updated_focus_points.js
Normal file
42
pb/pb_migrations/1733753354_updated_focus_points.js
Normal file
@ -0,0 +1,42 @@
|
||||
/// <reference path="../pb_data/types.d.ts" />
|
||||
migrate((app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_3385864241")
|
||||
|
||||
// update field
|
||||
collection.fields.addAt(1, new Field({
|
||||
"autogeneratePattern": "",
|
||||
"hidden": false,
|
||||
"id": "text2695655862",
|
||||
"max": 0,
|
||||
"min": 0,
|
||||
"name": "focuspoint",
|
||||
"pattern": "",
|
||||
"presentable": true,
|
||||
"primaryKey": false,
|
||||
"required": true,
|
||||
"system": false,
|
||||
"type": "text"
|
||||
}))
|
||||
|
||||
return app.save(collection)
|
||||
}, (app) => {
|
||||
const collection = app.findCollectionByNameOrId("pbc_3385864241")
|
||||
|
||||
// update field
|
||||
collection.fields.addAt(1, new Field({
|
||||
"autogeneratePattern": "",
|
||||
"hidden": false,
|
||||
"id": "text2695655862",
|
||||
"max": 0,
|
||||
"min": 0,
|
||||
"name": "focuspoint",
|
||||
"pattern": "",
|
||||
"presentable": false,
|
||||
"primaryKey": false,
|
||||
"required": true,
|
||||
"system": false,
|
||||
"type": "text"
|
||||
}))
|
||||
|
||||
return app.save(collection)
|
||||
})
|
Loading…
Reference in New Issue
Block a user