中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 软件工程技术研究开发中心  > 学位论文
Subject: 计算机软件::软件工程
Title:
基于单元阵列的电子表格计算语义错误检测与修复
Author: 窦文生
Issued Date: 2015-05-26
Supervisor: 魏峻
Major: 计算机软件与理论
Degree Grantor: 中国科学院大学
Place of Degree Grantor: 北京
Degree Level: 博士
Keyword: 电子表格 ; 单元阵列 ; 计算语义错误
Abstract:
    电子表格是当前广泛使用的终端用户开发工具,并广泛应用于数据记录、金融、教育等多个领域。电子表格中的数据与公式都具有一定的计算语义,当电子表格中的数据或公式未能反映终端用户的语义意图时,就会产生计算语义错误。该类错误极易导致电子表格中数据出现不一致性问题,从而降低电子表格的质量。
    电子表格中的通用计算语义错误是难以自动化检测与修复的,因为电子表格中数据和公式正确与否需要终端用户人为判定或提供规范严格检查,而在电子表格中不存在显式的规范。我们发现电子表格的一行或一列中的某些连续单元格往往具有相同的计算语义,这些连续单元格被称为单元阵列,而且单元阵列中可能存在公式丢失、公式不一致、数据不一致等计算语义错误。基于这个新观察,我们设计了一系列新方法来自动化检测并修复单元阵列及与其相关计算语义错误。
    本文从以下三个方面开展研究工作:
    (1)针对电子表格的数据集EUSES与Enron进行一系列实证研究,分析单元阵列在电子表格中的实际应用情况。研究结果发现单元阵列在电子表格中十分常见,具有公式的电子表格中68.6%的电子表格存在单元阵列;83.1%的单元阵列中的数据依赖关系是其每个单元格引用与该单元格同行/同列的其他单元格作为输入(我们称之为同构数据依赖);在单元阵列位置分布上,单元阵列极少相交。基于该实证研究的结果,我们研究了针对同构数据依赖/非同构数据依赖的单元阵列及与其相关计算语义错误检测与修复机制。
    (2)针对同构数据依赖的单元阵列,我们提出了一种相应的单元阵列检测与修复方法SameCheck。其中包括:基于单元阵列中同构数据依赖的特性,设计一组启发式规则来识别电子表格中的单元阵列;针对其中存在计算语义错误的单元阵列,提出了一种改进的程序合成机制生成单元阵列的计算语义,进而利用该计算语义来修复单元阵列相关的计算语义错误。
    (3)针对非同构数据依赖的单元阵列,我们提出基于数据依赖相似性的单元阵列检测与修复方法ShareCheck。其中包括:一种基于单元格数据依赖相似性的单元阵列识别算法;以及利用单元阵列不相交等特性过滤误报单元阵列的求精算法。
    我们基于EUSES数据集以及中国科学院软件研究所实际使用的电子表格,对SameCheck与ShareCheck进行一系列实验。实验结果表明:(1)单元阵列相关的计算语义错误是十分常见的,并且确实降低了电子表格的质量;(2)SameCheck与ShareCheck能够有效检测与修复单元阵列相关的计算语义错误,给终端用户提供有效的帮助信息;(3)ShareCheck在检测非同构数据依赖单元阵列的同时,能有效消除SameCheck误报的同构数据依赖单元阵列。

English Abstract:
    Spreadsheet is one of the most wildly used end-user development tools. Spreadsheet is used for data storage and tracking, financial reporting, education, and so on. Spreadsheet cells that contain data or formulas often have certain computational semantics. When cells’ data or formulas cannot prescribe the end-users intended semantics, they would suffer from computational semantic errors. This kind of errors could cause data inconsistency, which could degrade the quality of spreadsheets.
    It is hard to identify which cells contain computational semantic errors in spreadsheets, because this involves knowledge of intended semantics, which often requires human judgments or specifications. We observe that spreadsheet cells whose computation is subject to the same semantics are often clustered in a row or column (we name them as a cell array). In spreadsheets, cell arrays may suffer from several kinds of computational semantic errors, such as missing formula errors, inconsistent formula errors, and inconsistent data errors. Based on this observation, we propose two novel approaches to automatically detect and repair cell array-based computational semantic errors in spreadsheets.
    In this dissertation, we have done three pieces of work related to cell array-based computational semantic errors. 
    (1) We do the first empirical study to understand the key properties of cell arrays in real-life spreadsheets on EUSES and Enron corpuses. Our study shows several interesting findings as following: a) cell arrays are widely used in real-life spreadsheet, and about 68.6% of spreadsheet with formulas have used cell arrays. b) About 83.1% of cell arrays obey the isomorphic data dependence, in which each cell reference the cells in the same row/column with the cell as inputs. c) Cell arrays in spreadsheets rarely overlap. Based on this empirical study, we propose two novel approaches to detect and repair cell-array based computational semantic errors.
    (2) We propose SameCheck that detects and repairs cell arrays with isomorphic data dependence. SameCheck assumes that each cell in a cell array references the other cells with the same row/column with this cell as inputs, and proposes a novel algorithm to detect this kind of cell arrays. For the smelly cell arrays, SameCheck propose a novel improved component-based program synthesis to recover the cell arrays’ intended computational semantics. 
    (3) We propose ShareCheck that detects and repairs cell arrays with non-isomorphic data dependence. ShareCheck identifies cell arrays through a novel data dependence similarity of cells, and then leverages the properties (such as, non-overlap) of cell arrays to filter out wrongly identified cell arrays.
    The evaluations on EUSES corpus and case studies on real-life spreadsheets from Institute of Software Chinese Academy of Sciences show that (1) cell array-based computational semantic errors are very common, and they have degraded the quality of existing spreadsheets, (2) SameCheck and ShareCheck can detect and repair cell array-based computational semantic errors precisely, and (3) ShareCheck could detect cell arrays with non-isomorphic data dependence, and besides, could filter out the wrongly identified cell arrays with isomorphic data dependence by SameCheck.

Language: 中文
Content Type: 学位论文
URI: http://ir.iscas.ac.cn/handle/311060/17164
Appears in Collections:软件工程技术研究开发中心 _学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
毕业论文-1.8.pdf(4943KB)----限制开放 联系获取全文

Recommended Citation:
窦文生. 基于单元阵列的电子表格计算语义错误检测与修复[D]. 北京. 中国科学院大学. 2015-05-26.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[窦文生]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[窦文生]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2019  中国科学院软件研究所 - Feedback
Powered by CSpace