专利内容由知识产权出版社提供
专利名称:EXTRACTING PRINCIPAL CONTENT FROM
WEB PAGES
发明人:BIGNERT, Jakob,COARNA, Gabriel, Alexandru申请号:EP12847034.1申请日:20121107公开号:EP2776945A1公开日:20140917
摘要:Extracting principal content from Web pages includes identifying and classifyingitems on the Web page, building a list of candidates, calculating candidate scores,selecting a top score candidate, performing clean up processing for the top scorecandidate, and performing final page processing for the top score candidate. Candidatescores may vary according to a number of paragraphs and images grouped according tosize. A world length of CJK (Chinese-Japanese-Korean) text may be determined accordingto punctuation therein. Candidate scores may be modified according to a number ofcontainers and pieces and wherein a container is a Web page element that is associatedwith tags 'body', 'div', 'td', 'li', 'article/section' and pieces are candidates that do notinclude other candidates. Candidate scores may be modified according to a number ofratios corresponding to text and link density.
申请人:Evernote Corporation
地址:305 Walnut Street Redwood City, CA 94063 US
国籍:US
代理机构:Patentanwälte Freischem
更多信息请下载全文后查看