AllLinksAreSpam

Top  Previous  Next

Example projects > HTML and text parsers > AllLinksAreSpam

 

AllLinksAreSpam is based on the HTMLText-project. Syntactically they are identical. But a simple semantic action was inserted which classifies the mail as spam as soon as a link is found in it. This makes sense in the respect that almost every spam-mail is a vehicle for links. If one likes to allow links only in e-mails whose addresses are in the friend list, one has an effective spam-filter with AllLinksAreSpam. In addition, this project demonstrates the advantage of the HTML option over the text option: pure texts often doesn't contain all links.

 

The action is executed in the Link production:

 

  NORMAL_LINK               {{m_iResult = -1; }}

| "http://www.mydomain.com"

| "mailto:"?

  (

      EMAIL                 {{m_iResult = -1; }} 

    | "myname@mydomain.com"

  )

 

 

NORMAL_LINK is a regular expression which describes the pattern of most links.

 

NORMAL_LINK ::= 

(http://|ftp://)?[^\r\n\t <>"@]+(\.[^\r\n\t <>"@]+)+

 

EMAIL is a regular expression which describes the pattern of most e-mail-addresses:

 

EMAIL ::= 

[\w\.-]+ \// local part

@ \

([\w-]+\.)+ \ // sub domains

[a-zA-Z]{2,4} // top level domain

 

The addresses of one's own are a special case. Sometimes they are copied by spammers into the mail. But it also can be that these addresses indicate that the mail is an answer from a sender whom you haven't included in your friend list yet.

It is a good idea to develop a regular expression which matches the own address only when it is quoted exactly in the way how you write them into your mails. E.g.: the regular expression:

 

MY_EMAIL ::= -+\r?\nmailto:myname@mydomain.com

 

would match the following notation:

 

----------------

mailto:myname@mydomain.com

 

 

The Link production then could be completed to:

 

  NORMAL_LINK               {{if(m_iResult != 1) m_iResult = -1; }}

| "http://www.mydomain.com"

| "mailto:"?

  (

      EMAIL                 {{if(m_iResult != 1) m_iResult = -1; }} 

    | MY_EMAIL                  {{m_iResult = 1; }}

  )