中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 并行计算实验室  > 学位论文
Subject: 计算机科学技术 ; 计算机科学技术::人工智能 ; 计算机科学技术::人工智能::自然语言处理
Title:
一种基于深度学习的上市公司公告信息抽取系统
Author: 王文惠
Issued Date: 2019-05-17
Supervisor: 杨超
Major: 计算机软件与理论
Degree Grantor: 中国科学院研究生院
Place of Degree Grantor: 北京
Degree Level: 硕士
Keyword: 信息抽取系统 ; 数据回标 ; 命名实体识别 ; 实体关系抽取
Abstract:

上市公司公告主要负责公布公司的发展情况或投资者利益相关的重大事件,挖掘公告重要信息是专业的机构研究员每日的必要功课。随着深度学习的发展,自动信息抽取逐渐应用于多种场景中,极大地提高了人们的工作效率。本文在上市公司公告这一特定领域设计开发了一种基于深度学习模型的信息抽取系统,能够抽取文档级别的结构化数据,该系统分为模型训练和预测两部分,核心深度学习技术主要包括句子级别的命名实体识别和实体关系抽取。

    在模型训练部分,本文对爬取的公告PDF文件和文档级的结构化数据采用数据回标的方法生成较为准确句子级的训练语料。本文对经典的端到端的关系抽取模型进行改进,提出基于BLSTM_ATT和分段池化的实体关系抽取模型,能够捕捉长距离的依赖关系并获得更细致的特征,实验结果表明该模型在公司公告语料上的预测效果优于经典模型。

 

    在预测部分,每篇公告的结构化输出数据来源于非结构化文本和表格信息。非结构化文本信息抽取以句子为单位进行命名实体识别、指代消解和实体关系抽取,然后通过文档级信息融合模块将关键句上下文中的实体填入结构化模板中以获得文档级的结构化数据;对于表格,采用正则表达式的方法进行信息抽取。

   最后,本文在增减持和重大合同两类公告上对公告信息抽取系统进行实验验证。实验结果表明,本文所设计的信息抽取系统能够取得较高的准确率和运行速度,且系统具有可扩展性能够应用到多种类型的公告中。

 
English Abstract:

Most of announcements of listed companies are used for announcing development of the companies  or events related to investors. It is necessary for professional researchers to discover the important information from announcements. With the development of deep learning, automatic information extraction is gradually applied to many scenarios, which greatly improves work efficiency. We develops an information extraction system based on deep learning in the specific field of announcements of listed companies, which can extract document-level structured data. The system consists of the model training phase and the prediction phase. The core deep learning technologies include named entity recognition and entity relation extraction.

 

In the model training phase, we designs data retrieval method to generate sentence-level training corpus from PDF text and structured training data. We proposes an entity relation extraction model based on BLSTM$\_$ATT and piecewise max-pooling, which can capture long-distance dependencies and obtain more detailed features. The experimental results show that this model is better than the state-of-arts in the corpus of company announcements.

 

In the prediction phase, the structured data of each announcement comes from unstructured text and tables. For unstructured text, named entity recognition, anaphora resolution and entity relation extraction are carried out in sentence level. We have resolved anaphors about the entity's full name and abbreviation. Entities in the context of key sentences are filled into predefined templates through document-level information integration module. We use regular expression to extract key information from tables.

 

Finally, We tests the announcement information extraction system in Equity Overweight or Underweight and Major Contracts. The experimental results indicate that the information extraction system developed in this paper has achieved high accuracy and running speed. And the system is scalable and can be applied to a variety of announcements.

Language: 中文
Content Type: 学位论文
URI: http://ir.iscas.ac.cn/handle/311060/19171
Appears in Collections:并行计算实验室 _学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
一种基于深度学习的上市公司公告信息抽取系统.pdf(2013KB)学位论文--限制开放 联系获取全文

description.institution: 中国科学院软件研究所

Recommended Citation:
王文惠. 一种基于深度学习的上市公司公告信息抽取系统[D]. 北京. 中国科学院研究生院. 2019-05-17.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[王文惠]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[王文惠]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2020  中国科学院软件研究所 - Feedback
Powered by CSpace