Jump to content

Fighting SPAM and Cleansing Data of Unwanted Characterss


iwato

Recommended Posts

BACKGROUND:  Many months ago I created three custom search engines and two word clouds.  These latter are used to track visitor search interest and Grammar Captive's ability to match it.  Unfortunately, someone is now abusing my search engines and filling those fields of my database set aside to track search input.  The result of this malicious effort is word clouds that no longer perform their task.  While investigating the mess I have realized two important short-comings of my search engines that I must address:

The Sanitization/Filtering of Search Input -- In order to produce meaningful word clouds I must clean my search input of non-meaningful characters such as ', ", /, [, (, &quot,  etc.   As Further, as my search engines are capable of handling multilingual input, I must be able to handle unwanted characters in a variety of languages.

Span Prevention - I must be able to prevent visitors from abusing my search engines and thus destroying Grammar Captive's ability to track and analyze true visitor interest.

I do not see myself as resolving these two problems easily and would be grateful for any advice that would push in in the direction of successful resolution.

Roddy

PS.  If you would like to view the damage that has occurred, go to the Grammar Captive website and examine the word clouds found under Visitor Profile/Word Clouds in the navigation bar.

Link to comment
Share on other sites

As "word cloud" is not a well-defined term, are you saying that you're displaying on your site the search terms that people search for?  If that's the case, if you're displaying user input on your site, and want to avoid spam, then probably the only way to reliably do that is to only consider search terms that you have validated.  So you would need to build a way for you to periodically review and validate the search terms that you want to be included.

Removing unwanted characters is as easy as doing a string replace to find and replace whatever you want to exclude.  You can use a whitelist or blacklist approach.

Link to comment
Share on other sites

The term word cloud is very well defined, if you open to the suggested menu item, for there you can see the currently perverted result.  More importantly, how does one go about validating terms that user's make up?  Indeed, the whole purpose of the search engines and accompanying word clouds is to discover what my visitor's want and what they are able to find.  For, in this way I can better serve them, and they can better serve themselves.

Specifically,

  1. Visitor enter a keyword or phrase in the search box.
  2. A search is made for the keyword or phrase and all fields of a certain kind that contain that word or phrase are returned.
  3. A count is made of the number of matched rows within the relevant fields.
  4. The search keyword or phrase, the count, and an name for the indexed fields is sent to the local Matomo database
  5. The values for the above three variables are then parsed and two word clouds are generated:  one for the number of times that a particular keyword or phrase has been made; and one for the number of times that a match has been found.

Where would the validation take place in the above scheme of things?  And, what form are you suggesting that it would take?  The only truly secure way, of which I can think, to insure that spamming does not take place is to deny visitors who are not registered member from use of the search procedures.  This, however, appears extreme.

Roddy

Edited by iwato
Information upgrade
Link to comment
Share on other sites

The term word cloud is very well defined

Not the way I think you're using it, as in "displaying search terms."  It is a simple general data visualization tool to identify the most common terms in some set.  If you are applying that to user-supplied input, that's what I'm wondering and that's what I'm asking.

More importantly, how does one go about validating terms that user's make up?

In general, you just need to figure out what business rules you want to use for that.  How specifically do you want validation to work, what do you want to validate?  Do you want to use a whitelist, a blacklist, or some other method?

Link to comment
Share on other sites

Quote

If you are applying that to user-supplied input, that's what I'm wondering and that's what I'm asking.

I am using it in this way: to visualize recorded frequencies for identical search terms, and for recorded frequencies for discovered matches for those terms.

Quote

In general, you just need to figure out what business rules you want to use for that

I could white list registered members, but even these could be abusive.  I have also been thinking of implementing a no-robot captcha clicker.  For this would require only a simple click before initiating a search.  I have read, however, that even these will only constrain the number of abusers -- not eliminate them. 

Currently, Matomo collects everything.  Whereupon I extract the data from specific fields, and store then in another database.  Whereupon visitors call up the data on demand.  I suppose I could write an algorithm that would examine the redundancy and variety of searches from a single user over time.  My gosh.  Must I reinvent Google all over again?  This is why I installed Matomo in the first place.  In order to avoid having to reinvent the wheel.  My goal is not to outguess my visitors; rather, it is to learn from and teach them.

How about dual fields in the same database: one temporary and one permanent?  Create an algorithm that scans the temporary field and fill from the result the field that is used to create the respective word clouds?  Still, I would need guidelines for creating the algorithm's parameters.

Roddy

Edited by iwato
Informational upgrade
Link to comment
Share on other sites

If you want to eliminate spam you need to personally validate what's there, which is probably more work than you want to do.  If you're already using Matomo or something else for search, then I would look at the capabilities of that.  If you're not using anything to help search, then yeah you're on your own with writing the code you want to use, otherwise I'd look at capabilities of what you're using to help search.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...