Jump to content
iwato

Overcoming Text Direction with Wordcloud2

Recommended Posts

BACKGROUND:  Not only has my website been spammed, but the spam appears to be of a special sort.  It appears that someone has discovered a way to confuse Wordcloud2's interpretation of the input data.  My reason for believing this is my ability to make Wordcloud2 work with data obtained from within different temporal ranges of the same data field.  You can see clearly from the www.grammarcaptive.com mainpage under Visitor Profile/Word Clouds that under normal conditions Wordcloud2 works with all languages. When I open the temporal range to include the spam, however, WordCloud2 fails to render properly.

In order to make Wordcloud2 function properly it is necessary to create a List.  In the absence of the spam the items of this list appear in the following order.

targetList: الدواعي من فكرة قرامر كابتِف,1,構成,1,逆転分析,1,langue sécondaire,1,système éducatif,1,monde académique,1,Rundbrief,1,Sieben Tore,1,Socratic method,1,podcast,1,subject-verb,1,subject-verb pairs,1,subject-verb agreement,1,person and number,3,person,1,´person and number,1,person and happiness,2,´person and happiness,1

In the presence of the spam the list appears as follows:

targetList: 構成,157,逆転分析,157,langue sécondaire,157,système éducatif,157,monde académique,157,Rundbrief,157,Sieben Tore,157,Socratic method,157,podcast,157,subject-verb,157,subject-verb pairs,157,subject-verb agreement,157,&quotperson and number&quot,471,person,157,´person and number&#039,157,person and happiness,157,´person and happiness&#039,157,&#039person and happiness&#039,157,&#039&#039الدواعي من فكرة قرامر كابتِف,141

Notice the appearance of the Arabic entries in the two lists.

  1. In the first instance the Arabic appears first.  In the second instance it appears last.
  2. In the first instance the count appears with no HTML entity, in the second it appears with two.

QUESTION ONE:  How would you go about cleansing the data of the &quot, &#039, and &quotperson before the list is created?  It appears to require some sort of REGEX expression.

QUESTION TWO:  Where would you cleanse the data?  Before entry, or after retrieval?  Caution:  Cleaning the data upon entry would likely destroy the ability to search phrases.

Roddy

 

Share this post


Link to post
Share on other sites

OK.  I will try to make the problem easier to understand.  In order to do so, please examine carefully the following code snippet.

$.each(sourceObj, function(searchPhrase, searchCount) {
    listItem = [searchPhrase, searchCount];
    if (agex.test(searchPhrase)) {
        listItem = listItem.reverse();
    }
    list.push(listItem);
});

The phrase agex.test(searchPhrase)  tests whether the value of searchPhrase is in Arabic.   The result appears something like what follows:

listItem:  Array [ "構成", 157 ]
listItem:  Array [ "逆転分析", 157 ]
listItem:  Array [ "langue sécondaire", 157 ]
listItem:  Array [ "système éducatif", 157 ]
listItem:  Array [ "monde académique", 157 ]
listItem:  Array [ "Rundbrief", 157 ]
listItem:  Array [ "Sieben Tore", 157 ]
listItem:  Array [ "Socratic method", 157 ]
listItem:  Array [ "podcast", 157 ]
listItem:  Array [ "subject-verb", 157 ]
listItem:  Array [ "subject-verb pairs", 157 ]
listItem:  Array [ "subject-verb agreement", 157 ]
listItem:  Array [ 141, "الدواعيمنفكرةقرامركابتِف" ]

If I do not reverse the order of the item, the corresponding listItem is returned as follows:

listItem:  Array [ "141, "الدواعيمنفكرةقرامركابتِف ]

Either way the phrase and count are reverse in the final list and Wordcloud2 fails.

When the value of searchPhrase is Arabic agex.test(searchPhrase) returns true as expected.

Roddy

Share this post


Link to post
Share on other sites

How would you go about cleansing the data of the &quot, &#039, and &quotperson before the list is created?  It appears to require some sort of REGEX expression.

You could use a regex, you could also use str_replace to just replace specific characters.

Where would you cleanse the data?

It depends if you want to save the original data or not.  If you don't care about the data before it gets cleaned, then clean it before you save it.

Share this post


Link to post
Share on other sites

Yes,  the agex variable to which the .test() function is applied contains a regular expression. 

var agex = new RegExp(
	/[\u0600-\u06ff]
	|[\u0750-\u077f]
	|[\ufb50-\ufbc1]
	|[\ufbd3-\ufd3f]
	|[\ufd50-\ufd8f]
	|[\ufd92-\ufdc7]
	|[\ufe70-\ufefc]
	|[\uFDF0-\uFDFD]/g
	);

This expression is supposed to identify all characters written in Arabic script including numbers, punctuation, and various diacritical markings.  I can do the same for double-byte Japanese as well (not shown below).  This same procedure cannot be used for ASCII, however.  Indeed, I am finding it difficult to remove HTML entities like ĵ  and ".

When a visitor enters a search term, Matomo appears to use something akin to urlencode( ) before anything is entered into its own database.  When I pull it out I am left with encoded HTML entities that I, generally speaking, do not want (neither encoded, nor decoded).  This said, an apostrophe between words or letters can be meaningful.  Consider for example the following for phrases "my friend's idea", 'my friend's idea', "my friends' ideas",  and  'my friends' ideas'.  Now, all of the single and double quotation marks are encoded identically, but only the following two are desired

friend's

friends' ideas

My gosh I cannot even write the REGEX to successfully eliminate ĵ and &quot.  Compare the following list items with the Arabic item at the bottom.

listItem:  Array [ "´person and number&#039", 157 ]jquery-1.11.3.min.js%20line%202%20%3E%20eval:107:9
listItem:  Array [ "person and happiness", 157 ]jquery-1.11.3.min.js%20line%202%20%3E%20eval:107:9
listItem:  Array [ "´person and happiness&#039", 157 ]jquery-1.11.3.min.js%20line%202%20%3E%20eval:107:9
listItem:  Array [ "&#039person and happiness&#039", 157 ]jquery-1.11.3.min.js%20line%202%20%3E%20eval:107:9
listItem:  Array [ 141, "الدواعيمنفكرةقرامركابتِف" ]

Yes, the Arabic list item does not fall in the proper order, but at least it comes out clean.  The following REGEX that I wrote myself simply does not work.

var pungex = new RegExp(/(&#\d{3};)|(&\s{4};)/g);

The CODE

if (name === 'searchKeyword') {
  var nakedArabic = [];
  var arabicText = '';
  if (agex.test(value)) {
      nakedArabic = value.match(agex);
      searchItem.target = nakedArabic.join('');
  } else {
	  var strippedStr = value.replace(pungex,'');
	  searchItem.target = strippedStr;
  }
}							
Edited by iwato

Share this post


Link to post
Share on other sites

Indeed, I am finding it difficult to remove HTML entities like ĵ  and ".

How come, what's the difficulty?  Is the  problem that the escape sequences seem to be missing the terminating semicolon?  Can you figure out why that is?  If you're only seeing specific sequences then you could just look for those specifically and remove them.

Share this post


Link to post
Share on other sites

Yes and know.  Please match the following results against the included code.

RESULTS

listItem:  Array [ "subject-verb pairs", 157 ]
listItem:  Array [ "subject-verb agreement", 157 ]
listItem:  Array [ "&quotperson and number&quot", 471 ]
listItem:  Array [ "person", 157 ]
listItem:  Array [ "´person and number", 157 ]
listItem:  Array [ "person and happiness", 314 ]
listItem:  Array [ "´person and happiness", 157 ]
listItem:  Array [ 141, "الدواعيمنفكرةقرامركابتِف" ]

The CODE

$.each(jsonData, function(key, object) {
    var searchItem = {};
    var strippedStr = '';
    var agex = new RegExp(/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD]/g);
   var pungex = new RegExp(/(&#\d{3};)|(&\s{4};)/g);
    var pregex = new RegExp(/["\'\;]/g);
    $.each(object, function(name, value) {
        if (name === 'searchKeyword') {
            var nakedArabic = [];
            if (agex.test(value)) {
                nakedArabic = value.match(agex);
                searchItem.target = nakedArabic.join('');
            } else {
            var strippedStr = value.replace(pungex, '');
            strippedStr = strippedStr.replace(pregex, '');
            searchItem.target = strippedStr;
            }
        }							
        if (name === 'searchCategory') {
            searchItem.category = value;
        }
        if (name === 'searchResultsCount') {
            searchItem.count = value;
        }
    });
    searchItems.push(searchItem);
});

Roddy

Edited by iwato

Share this post


Link to post
Share on other sites

I doubt that someone is typing "&quot" into a search field somewhere, so that data is getting messed up somewhere along the way.  Ideally you would find where and fix it there so that the data can be normalized to start with.  Terms shouldn't be repeated just because they have extra punctuation.  Otherwise, before the data gets added to the database try to clean it up by looking for those specific terms and removing them.

Share this post


Link to post
Share on other sites

Understand that my database has been spammed.  Examine carefully the count associated with each of the search terms.  You will see that the same numbers are associated with  different terms and that the count for each is quite large.  Notice too, the accent aigue (´) in front of the phrases "person and happiness" and "person and number";  this is not an accident.

targetList: 141,الدواعيمنفكرةقرامركابتِف,´person and happiness,157,person and happiness,314,´person and number,157,person,157,&quotperson and number&quot,471,subject-verb agreement,157,subject-verb pairs,157,subject-verb,157,podcast,157,Socratic method,157,Sieben Tore,157,Rundbrief,157,monde académique,157,système éducatif,157,langue sécondaire,157,逆転分析,157,構成,157

Yes, I can cleanse at the point of input, but how do I write the regex for my two examples:

friend's idea and friends' ideas

?

Though the cleansing issue remains important, the bigger issue is finding a way to stop the reversal of target and count elements in the case of Arabic search words. I have even forcefully reversed the direction of the Arabic list items, but as soon as they are pushed to the list, they revert to the opposite ordering.

Roddy

Edited by iwato
Missing code supplied.

Share this post


Link to post
Share on other sites

I'd remove all text direction modifiers. They are the Unicode characters 202A, 202B, 202C, 202D, 202E, 202F.

  • Like 1

Share this post


Link to post
Share on other sites

Yes, I can cleanse at the point of input, but how do I write the regex for my two examples:

I'm not sure why you're trying to filter out all apostrophes at all, because your problem is not apostrophes, but an obvious pattern would be an apostrophe that does not follow a letter.

Share this post


Link to post
Share on other sites
$.each(jsonData, function(key, object) {
    var searchItem = {};
    var nakedArabic = [];
    var agex = new RegExp(/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD]/g);
    $.each(object, function(name, value) {
        if (name === 'searchKeyword') {
            if (agex.test(value)) {
                nakedArabic = value.match(agex);
                searchItem.target = '\u2067' + nakedArabic.join('') + '\u2069';
            } else {
                searchItem.target = value;
            }
        }							
        if (name === 'searchCategory') {
            searchItem.category = value;
        }
        if (name === 'searchResultsCount') {
            searchItem.count = value;
        }
    });
    searchItems.push(searchItem);
});

Alas, I have resolved this problem:

listItem:  Array [ "subject-verb pairs", 157 ]
listItem:  Array [ "subject-verb agreement", 157 ]
listItem:  Array [ "&quotperson and number&quot", 471 ]
listItem:  Array [ "person", 157 ]
listItem:  Array [ "´person and number", 157 ]
listItem:  Array [ "person and happiness", 314 ]
listItem:  Array [ "´person and happiness", 157 ]
listItem:  Array [ 141, "الدواعيمنفكرةقرامركابتِف" ]

with the following text wrapper:

searchItem.target = '\u2067' + nakedArabic.join('') + '\u2069';

After three weeks of study that led me down many new paths I was finally able to identify the source of the problem.  The numbers that followed the Arabic text were being read as part of the Arabic text.  In order to prevent this from occurring it was necessary to isolate the Arabic text.  As I was dealing with Javascript, it was not possible to achieve this with normal HTML mark-up.  Whereupon, I discovered something called Unicode Controls.

I wish the discovery had been as simple as the solution, but in the end I have issued from the inquiry victorious and far better informed about Unicode, Javascript, and Bidirectional Text.

Roddy

 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...