失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > NLP 常用数据集及语料库

NLP 常用数据集及语料库

时间:2024-01-22 16:21:11

相关推荐

NLP 常用数据集及语料库

数据集

1. Yelp reviews

yelp 可类比为中国的大众点评. 数据集介绍见参考[4].

figure yelp 网站的点评. 星星个数是评价.

2. Yahoo answers

a topic classification task with 10 classes :

Society & CultureScience & Mathematics,HealthEducation & ReferenceComputers & Internet,SportsBusiness & Finance,Entertainment & MusicFamily & Relationshipsand Politics & Government

The document includes question titles, question contexts and best answers. There are 140,000 training samples and 5000 testing samples.

语料库

1.Sogou News Corpus

搜狗新闻语料库. Containing in total 2,909,551 news articles in various topic channels.

参考文献[1] 中是这么描述与使用的: :

There are a large number categories but most of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”, “automobile” and “technology”. The number of training samples selected for each class is 90,000 and testing 12,000.

2. YFCC 100M

YaHoo 实验室的多媒体数据集, 用处不局限于NLP. 地址在参考文献[3]中.

内含约 1亿 张图片 与 100 万个视频, 有 标题, 说明 与 标签. 即 title, captions and tags.

它的标注是多元的, 比如一只小狗, 会被标注动物/小狗/宠物/狮子狗等.

FastText 论文中, 用到了它作 Tag Prediction.

参考

Character-level Convolutional Networks for Text Classification搜狗实验室YFCC 100Myelp dataset challenge 官网: yelp dataset challenge

如果觉得《NLP 常用数据集及语料库》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。