
Top  Previous  Next

Example projects > HTML and text parsers > AllLinksAreSpam


AllLinksAreSpam is based on the HTMLText-project. Syntactically they are identical. But a simple semantic action was inserted which classifies the mail as spam as soon as a link is found in it. This makes sense in the respect that almost every spam-mail is a vehicle for links. If one likes to allow links only in e-mails whose addresses are in the friend list, one has an effective spam-filter with AllLinksAreSpam. In addition, this project demonstrates the advantage of the HTML option over the text option: pure texts often doesn't contain all links.


The action is executed in the Link production:


  NORMAL_LINK               {{m_iResult = -1; }}

| ""

| "mailto:"?


      EMAIL                 {{m_iResult = -1; }} 

    | ""




NORMAL_LINK is a regular expression which describes the pattern of most links.



(http://|ftp://)?[^\r\n\t <>"@]+(\.[^\r\n\t <>"@]+)+


EMAIL is a regular expression which describes the pattern of most e-mail-addresses:


EMAIL ::= 

[\w\.-]+ \// local part

@ \

([\w-]+\.)+ \ // sub domains

[a-zA-Z]{2,4} // top level domain


The addresses of one's own are a special case. Sometimes they are copied by spammers into the mail. But it also can be that these addresses indicate that the mail is an answer from a sender whom you haven't included in your friend list yet.

It is a good idea to develop a regular expression which matches the own address only when it is quoted exactly in the way how you write them into your mails. E.g.: the regular expression:


MY_EMAIL ::= -+\r?\


would match the following notation:





The Link production then could be completed to:


  NORMAL_LINK               {{if(m_iResult != 1) m_iResult = -1; }}

| ""

| "mailto:"?


      EMAIL                 {{if(m_iResult != 1) m_iResult = -1; }} 

    | MY_EMAIL                  {{m_iResult = 1; }}
