您的当前位置：首页 EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES

EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES

来源：五一七教育网

专利内容由知识产权出版社提供

专利名称：EXTRACTING PRINCIPAL CONTENT FROM

WEB PAGES

发明人：BIGNERT, Jakob,COARNA, Gabriel, Alexandru申请号：EP12847034.1申请日：20121107公开号：EP2776945A1公开日：20140917

摘要：Extracting principal content from Web pages includes identifying and classifyingitems on the Web page, building a list of candidates, calculating candidate scores,selecting a top score candidate, performing clean up processing for the top scorecandidate, and performing final page processing for the top score candidate. Candidatescores may vary according to a number of paragraphs and images grouped according tosize. A world length of CJK (Chinese-Japanese-Korean) text may be determined accordingto punctuation therein. Candidate scores may be modified according to a number ofcontainers and pieces and wherein a container is a Web page element that is associatedwith tags 'body', 'div', 'td', 'li', 'article/section' and pieces are candidates that do notinclude other candidates. Candidate scores may be modified according to a number ofratios corresponding to text and link density.

申请人：Evernote Corporation

地址：305 Walnut Street Redwood City, CA 94063 US

国籍：US

代理机构：Patentanwälte Freischem

更多信息请下载全文后查看

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文