You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Dnomd343
2b5f9d6341
|
2 years ago | |
---|---|---|
assets | 2 years ago | |
crawler_release | 2 years ago | |
release | 2 years ago | |
sample | 2 years ago | |
src/crawler | 2 years ago | |
.gitignore | 2 years ago | |
README.md | 2 years ago | |
xxrs.json | 2 years ago | |
xxrs.txt | 2 years ago |
README.md
栩栩若生
graph LR
subgraph sample
subgraph raw
s1a_raw{{sample_1-a}}
s1b_raw{{sample_1-b}}
s2a_raw{{sample_2-a}}
s2b_raw{{sample_2-b}}
s3_raw{{sample_3}}
end
subgraph combine
s1_combine[sample_1]
s2_combine[sample_2]
s3_combine[sample_3]
end
subgraph fixed
s1_fixed(sample_1)
s2_fixed(sample_2)
s3_fixed(sample_3)
end
s1a_raw -- replenish --> s1_combine
s1b_raw --> s1_combine
s2a_raw -- replenish --> s2_combine
s2b_raw -- replenish --> s2_combine
s3_raw -- clean up --> s3_combine
s1_combine -- fix --> s1_fixed
s2_combine -- fix --> s2_fixed
s3_combine -- fix --> s3_fixed
end
subgraph crawler
source_1([108shu.com]) --> s1a_raw
source_2([aidusk.com]) --> s1b_raw
source_3([ixsw.la]) --> s1b_raw
source_4([m.wxsy.net]) --> s2a_raw
source_5([wxsy.net]) --> s2a_raw
source_6([xswang.com]) --> s2b_raw
source_7([zhihu.com]) --> s3_raw
end
subgraph release
sa{{sample_a}}
sb{{sample_b}}
s1_fixed --> sa
s2_fixed -- replenish --> sa
s2_fixed -. restore .-> sb
s3_fixed -- replenish --> sb
end
数据爬虫来源
爬虫样本分析
原始爬虫得到5份三组不同 raw
数据:
-
sample_1-a
-
sample_1-b
-
sample_2-a
-
sample_2-b
-
sample_3
经过简单合并后可得到三份初始 combine
样本:
-
sample_1
-
sample_2
-
sample_3
三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 fixed
样本,而后对其合并,获得两组 release
样本:
-
sample_a
-
sample_b
两组样本只有微小的分隔区别,经过修复合并后得到 RC
样本。