Jump to content

Troubleshooting MySQL's the Mecab Parser


iwato

Recommended Posts

BACKGROUND: I have built a custom search engine that works fine in English, but fails in Japanese, this despite confirmation from my host server that I have performed the installation of the Japanese mecab parser correctly. My own checks reveal the following:

1) SHOW CREATE TABLE:

FULLTEXT KEY search_newsletter (letter_title, letter_abstract, letter_body) /*!50100 WITH PARSER mecab */ ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1

2) SHOW PLUGINS:

ngram | ACTIVE | FTPARSER | NULL | GPL | mecab | ACTIVE | FTPARSER | libpluginmecab.so | GPL

IMPLEMENTATION

1) MYSQL Statement:

$sql ="SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC";

2) CUSTOM SEARCH ENGINE:

See under Local Search / Newsletters at https://www.grammarcaptive.com/overview.html

3) DOCUMENT SEARCHED:

See under Regular Updates / Newsletter / Archives / Japanese at https://www.grammarcaptive.com/overview.html

COMMENT: Neither PHP, nor MySQL complains. Simply any Japanese word search that needs to be parsed is not returned. For example, the word 日本語 can be search and found, but does not require any parsing to be retrieved. The search for any other Japanese word in the newsletter fails.

REQUEST: Any troubleshooting tips would be greatly appreciated.

Roddy

Link to comment
Share on other sites

Thank you for the suggestion.  This is what I discovered.

The ORIGINAL

SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC

TEST ONE:  Insert a Japanese word that requires use of the Mecab parser to parse.

SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (本質 IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (本質 IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC

RESULT ONE:  Error - The Japanese word is recognized as an unidentified field or column.

TEST TWO:  Insert the same Japanese word delimited with single quotation marks.

SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ('本質' IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ('本質' IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC

RESULT TWO:  No error is generated.  Neither, however, is a match found.

TEST THREE:  Insert the same Japanese word delimited with double quotation marks.

SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC

RESULT THREE:  No error is generated.  Neither, however, is a match found.

Please advise further.

Roddy

 

Edited by iwato
Link to comment
Share on other sites

RESULT ONE:  Error - The Japanese word is recognized as an unidentified field or column.

You always need to quote string data in a query, regardless of the language, plugins, or the way you're running the query.  If it's not quoted then MySQL thinks it is an identifier.

Although, your tests did not address my question.  I'm wondering about the differences in queries you run in your own PHP code, queries you run in phpMyAdmin, and queries you run in the mysql console.  This has nothing to do with quotation marks, other than the fact that you need to correctly quote data in a query.

 

Link to comment
Share on other sites

32 minutes ago, justsomeguy said:

Although, your tests did not address my question.  I'm wondering about the differences in queries you run in your own PHP code, queries you run in phpMyAdmin, and queries you run in the mysql console.

The code that I ran with phpMyAdmin and the code that I run with PHP when I make AJAX calls are the same.  Simply, I had to substitute the ? mark with an actual term and was compelled to enter the quotation marks manually.

Roddy

Edited by iwato
Link to comment
Share on other sites

I have just run TEST THREE above in the MySQL console.  It produces the same result as in the SQL menu of phpMyAdmin.

mysql> SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC;
Empty set (0.68 sec)

This would explain why, in part, that no complaint is received from either PHP or MySQL, everything appears to be running fine -- simply it is not.

Roddy

p.s.  As far as the character code is concerned the table is set to Latin, but the columns/fields in the table that are being search are set to UTF-8.

Edited by iwato
Link to comment
Share on other sites

I don't know what you mean in this sentence, though:

Simply any Japanese word search that needs to be parsed is not returned. For example, the word 日本語 can be search and found, but does not require any parsing to be retrieved. The search for any other Japanese word in the newsletter fails.

I don't know why some words need to be parsed and others don't, or maybe what you mean by parsing in that context.

Link to comment
Share on other sites

Quote

I don't know why some words need to be parsed and others don't

In most languages individual words can be sought and found because all words in a text are separated by blank spaces.  For example, I have no trouble performing search and find operations for Arabic, Korean, or English. 

Chinese and Japanese words, however, are not separated by blank spaces and are impossible, as a result, to distinguish from their neighbors.  The ngram and mecab parsers do not depend on blank spaces to perform their search, rather they search according to the number of bytes that typically make up Japanese and Chinese words:  one, two, three, four, and at most five characters, generally speaking.  One and two character words are the more common.

The word 日本語 (Japanese language) is returned because it appears with a space before and after.  The word 本質, however, is surrounded by text on either side.  It is not delimited by spaces.

Now, I just performed the additional MySQL command and discovered the following:

SHOW STATUS LIKE 'mecab_charset';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| mecab_charset | utf8  |
+---------------+-------+
1 row in set (0.03 sec)

This is additional proof that there is nothing wrong with my installation.

My intuition suggests that the problem may lie with the DEFAULT CHARSET of my data table that is set to Latin.   This said, changing the character set of a table can be problematic, and the columns themselves are correctly set to UTF-8.  Thus, before proceeding I would like to ascertain the character set requirements for the InnoDB MySQL table when using the mecab parser.  This, however, I have been unable to find.

Roddy

 

 

Link to comment
Share on other sites

This is additional proof that there is nothing wrong with my installation.

If you want to start removing possibilities you can do that, but if I saw something that wasn't making sense I wouldn't rush to say that certain things are definitely not the cause.  You have indications that it is installed correctly, but I wouldn't switch from "indication" to "proof" too quickly, especially considering the lack of success of some of your past assumptions.  That being said, maybe everything is fine and the plugin just doesn't work.  I haven't been watching you do everything so I can't say what is or is not likely.

My intuition suggests that the problem may lie with the DEFAULT CHARSET of my data table that is set to Latin.

So, fix it.  Create a second table that is set to the correct character set.  Export the data from one table, and then import into the other table.  Or, even better, start with clean data and only add your test data.  No reason to add a possibility of incorrect data import.  The fewer variables the better.  This is why people say to create a minimal example showing the problem.  The more variables you can eliminate, the easier it will be to figure out.  If you have something you suspect might be the problem that you want to test, then test it.  Create a test case and test it.  Eliminate other variables.  In fact, you should start with the absolute minimal setup required and make sure that works.  If it doesn't work at all then it's either not installed correctly or it doesn't work.

Thus, before proceeding I would like to ascertain the character set requirements for the InnoDB MySQL table when using the mecab parser.  This, however, I have been unable to find.

This is another reason to contact the developers or their community, although I would be surprised if it was anything other than UTF-8.

  • Thanks 1
Link to comment
Share on other sites

  • 5 months later...

So, in the end, the table was corrupted, and I had no back up. Alas, I reconstructed the entire table with UTF-8 encoding.  The result was nothing but spectacular.  I now have an in-tact Mecab parser that parses Japanese.  Not only this, but the Mecab parser does not interfere with the parsing of French, English, and German., or likely any other language that using blank spaces to separate words.  Arabic will be my next challenge, but I am fairly confident that it will work just like English with no additional encoding required.

Roccy

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...