two, noise removal

"SNR is the ratio of the text and the HTML code on the page, but also refers to the" useful information and useless, irrelevant information ratio, it is a very important concept, related to the web page code streamline, in fact, a little bit of knowledge based search engine principle people know that grasping system of search engine, is the first to download web pages, then the text inside the extracted, after some analysis, remove the inside of the HTML format, eliminate the noise, and then segmenting here, obviously you can see the search engine is a noise removing step and if we help the search engine to maximize the denoising effect what? The search engine will be very love you do, "the more streamlined, the search engine spiders Grab the efficiency of the program will be higher. In practice, how do we improve the signal-to-noise ratio of "


4, flash, iframe and reduce the image by

codeThe first step of

search engine for noise removal is clear HTML format, we are streamlining the website HTML code in the first step of denoising, we often say, "the code to streamline, web page code with W3C format, do not try to try to use div+css code, table, said is streamlining the code, only but many people do not know what are the benefits of search engines, so for some full-time staff of Shanghai dragon, principle or want to learn some knowledge of the search engine, so as to grasp the search engine on the whole, this is a great help for the future to engage in related work of the Shanghai dragon. Remove the noise code mainly includes:

HTML code and CSS

search engine text after the noise is the removal of "inside the next step," noise removal is actually extract the theme of the web page, put some irrelevant content directly ignored, such as search engine in the judgment of the subject of the page, the copyright information navigation, footer. And some public sector directly filtered out, for them, these are the "noise, each page has, it is public module, it can not represent the main content of the page, the page theme, is the interference mechanism, such as some B2C website product information on the bottom page for a large part of the content in the written notice, security, payment and so on, all of these interference search engine for web pages. Broken. The search engine will be.

1, JS code, try to minimize the use of external

, a noise removal

? Extraction, removal of noise code

3, when using DIV+CSS, try to reduce its level of nested

2, CSS try to merge, and can transfer the transfer, to achieve separation of

