21 May 2008

DSpam: WebGUI modifications - Javascript goodies.

Ive been using dspam for two years now. It has been set up as a broad filter which checks all the emails passing through our systems. It has worked very well over the past year although going through the thousands of emails has been a chore.

The WebGUI for dspam is rudimentary, and is really built for single users to review their own spam/hams. It really isnt designed for large volumes. So I had to make some modifications.

Modification #1: Looking for missed SPAM.

The problem with the WebUI when you view the History tab is that it displays ALL the emails which pass through the system. This would be a normal requirement, but if you really are just interested in reviewing the False Negatives, i.e. SPAM that got away, it really takes ages to scroll through the pages one after another. If your domain has attracted alot of spam, over 70% of the entries are spam anyway, and you arent really interested in that information.
So the solution is to ignore the SPAM entries from the /var/dspam/.../dspam.log file. To do this, you will need to modify the /var/www/html/dspam.cgi file.

This is the patch (dspam.skipSPAM.patch):

With this patch, you can toggle the ability to skip through Spams by adding in another url argument &skipSPAM=true on the URL address. Otherwise, if you want this on by default, just make $skipSPAM = "true" in the perl script.

The result should look something like this:
Notice how the Spams are ignored, giving you a clear view of what to retrain or allow. I also skip through Whitelisted emails so that means less lines to review. I use Firefox's tab browsing, and just middle click the entries which I want to flag off as Spam. The process is very fast, and I probably need to click through about 3 pages of history to mark off any significant growth in spam.

Modification #2: Marking Dead meat the brute force way
One of the hassles of clearing False Positives is the process of going through each and every spam item and checking it off. Early on, I modified the nav_quarantine.html template file with this small Javascript which checked off the first 200 items. Here is the patch for the "Select 200" modification for the templates/nav_quarantine.html file (nav_quarantine.select200.patch)

This worked well to a certain extent, but when you have over 10K entries, refreshing the page 500 times is certainly not an option. There MUST be a better way. And there is ...

Modification #3: Marking Dead meat the elegant way.
Blindly selecting the first 200 entries isnt really an efficient way of culling the confirmed spam. I needed an almost automated way to handle this. So I embedded more information from dspam into the WebUI and written some Javascript to make this process alot more bearable.

The first requirement is to remove all the marked spam of a given percentage of certainty. Throughout the entire production usage of DSpam, I have yet to see a False Positive with a certainty score of more than 70%. What would be great is to check off all entries given a score automatically. This is now possible by entering a confidence number, and simply clicking on the "Mark Rating" button.
What the script does is that it uses XPath to query out all rows which have a rating of more than what is entered. The Javascript code looks something like this:

var pRate = parseFloat( document.getElementById("rating_val").value ) / 100;
var xpath = document.evaluate( "//tr[\@rating > "+pRate+" ]", document, null, +XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null );
I have also added an extra feature in that it will also mark of similar items which may have less of a confidence than provided. This will be elaborated below, in the "hash" I generate with all the entries later. Because of this 'recursive' behaviour, the script will take a while to complete, so you may need to increase the timeout for Firefox (otherwise it may complain with a "A script on this page may be busy, or it may have stopped responding. You can stop the script right now, or you can continue to see if it completes.") To do so, type about:config in the URL bar, and adjust the dom.max_script_runtime from the default of 10 to something larger like 500.

Modification #4: Ajax waits for no refresh.
Another tedious part about using the WebUI is that whenever you need to purge the quarantine of the caught spam, it takes ages because it causes an entire page refresh. Its OK if the list is less than a thousand, but when it reaches 20K or more, its just too much.

Then we come to another problem when deleting the entries from the /var/dspam/.../dspam.mbox file. As you remove entries from the file, if at any point during that time, a new email arrives, the deletion will cancel and the file will roll back to its original state plus new email. So realistically on a busy system, you can't delete more than 50 spams at a go. This means we will have to endure ALOT of page refreshes.

What I implemented then was a AJAX type handler for dspam.cgi to execute. I added the Javascript features in the WebUI, and it looks like this:
As you click the button, the Javascript will scroll through the checked list, and when it compiles 25 entries, it forms a query back to dspam.cgi to execute in the background. It will alert the user by stating it is currently "deleting 25". When the call is successful, it will state "deleted 25". It will then repeat the process if there are still items to be checked.

The figure 25 is something which I found to be small enough to cater for non roll backs, and because the process is automated, it doesnt need to be large. So to clear off 15K entries, it takes about 30 minutes to an hour.

Modification #5: Hashing up spam
For the remaining spam which isnt obvious, I have included two little clickies on the end of the table. "del" deletes the entry immediately, while "hash" checks the entry's checkbox, and all entries with similar subjects. This means you can click off multiple spams with just one click as demonstrated below:

This makes marking off spam almost ... fun!

Patching the WebGUI
I include with this post three patches. In the dspam.cgi directory, run this:
# patch < dspam.skipSPAM.patch
# patch < dspam.ajax.patch

and in the template directory
# patch < nav_quarantine.select200.patch

Otherwise dspam.cgi and template/nav_quarantine.html are also available.

How I use these modifications
Whenever I have the time to review the spam collection
  1. I load up the quarantine page until its fully loaded.
  2. I then click on "Mark Spam" with the default rate of 85%.
  3. This takes a few seconds depending on your PC.
  4. I then click on the "Ajax delete" button to start the deletion process in the background.
  5. In the meantime, I reduce the rate down to 70% and sometimes 60% to clear off further spam.
  6. I also start from the top, i.e. 47% confidence spam items, and slowly review the items up to about 53%, clicking on the "hash" to remove the spam items.
  7. After I clean off the False Positives (if any) I click on the "Select 200" and eyeball the remaining items until there are no entries left.
  8. It still takes some time, but at least its a whole less time than before!

I hope this helps!

yk