iwato Posted December 2, 2018 Share Posted December 2, 2018 0 down vote favorite BACKGROUND: I have built a custom search engine that works fine in English, but fails in Japanese, this despite confirmation from my host server that I have performed the installation of the Japanese mecab parser correctly. My own checks reveal the following: 1) SHOW CREATE TABLE: FULLTEXT KEY search_newsletter (letter_title, letter_abstract, letter_body) /*!50100 WITH PARSER mecab */ ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1 2) SHOW PLUGINS: ngram | ACTIVE | FTPARSER | NULL | GPL | mecab | ACTIVE | FTPARSER | libpluginmecab.so | GPL IMPLEMENTATION 1) MYSQL Statement: $sql ="SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC"; 2) CUSTOM SEARCH ENGINE: See under Local Search / Newsletters at https://www.grammarcaptive.com/overview.html 3) DOCUMENT SEARCHED: See under Regular Updates / Newsletter / Archives / Japanese at https://www.grammarcaptive.com/overview.html COMMENT: Neither PHP, nor MySQL complains. Simply any Japanese word search that needs to be parsed is not returned. For example, the word 日本語 can be search and found, but does not require any parsing to be retrieved. The search for any other Japanese word in the newsletter fails. REQUEST: Any troubleshooting tips would be greatly appreciated. Roddy Link to comment Share on other sites More sharing options...
justsomeguy Posted December 3, 2018 Share Posted December 3, 2018 Are the results the same in phpMyAdmin or if you run the queries in the mysql console? Link to comment Share on other sites More sharing options...
iwato Posted December 3, 2018 Author Share Posted December 3, 2018 (edited) Thank you for the suggestion. This is what I discovered. The ORIGINAL SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC TEST ONE: Insert a Japanese word that requires use of the Mecab parser to parse. SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (本質 IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (本質 IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC RESULT ONE: Error - The Japanese word is recognized as an unidentified field or column. TEST TWO: Insert the same Japanese word delimited with single quotation marks. SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ('本質' IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ('本質' IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC RESULT TWO: No error is generated. Neither, however, is a match found. TEST THREE: Insert the same Japanese word delimited with double quotation marks. SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC RESULT THREE: No error is generated. Neither, however, is a match found. Please advise further. Roddy Edited December 3, 2018 by iwato Link to comment Share on other sites More sharing options...
justsomeguy Posted December 3, 2018 Share Posted December 3, 2018 RESULT ONE: Error - The Japanese word is recognized as an unidentified field or column. You always need to quote string data in a query, regardless of the language, plugins, or the way you're running the query. If it's not quoted then MySQL thinks it is an identifier. Although, your tests did not address my question. I'm wondering about the differences in queries you run in your own PHP code, queries you run in phpMyAdmin, and queries you run in the mysql console. This has nothing to do with quotation marks, other than the fact that you need to correctly quote data in a query. Link to comment Share on other sites More sharing options...
iwato Posted December 3, 2018 Author Share Posted December 3, 2018 (edited) 32 minutes ago, justsomeguy said: Although, your tests did not address my question. I'm wondering about the differences in queries you run in your own PHP code, queries you run in phpMyAdmin, and queries you run in the mysql console. The code that I ran with phpMyAdmin and the code that I run with PHP when I make AJAX calls are the same. Simply, I had to substitute the ? mark with an actual term and was compelled to enter the quotation marks manually. Roddy Edited December 3, 2018 by iwato Link to comment Share on other sites More sharing options...
justsomeguy Posted December 3, 2018 Share Posted December 3, 2018 So you're not seeing any difference in the queries that your own code run versus the queries you run in other programs? The point is to test character set issues in your own code. Link to comment Share on other sites More sharing options...
iwato Posted December 3, 2018 Author Share Posted December 3, 2018 (edited) I have just run TEST THREE above in the MySQL console. It produces the same result as in the SQL menu of phpMyAdmin. mysql> SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST ("本質" IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC; Empty set (0.68 sec) This would explain why, in part, that no complaint is received from either PHP or MySQL, everything appears to be running fine -- simply it is not. Roddy p.s. As far as the character code is concerned the table is set to Latin, but the columns/fields in the table that are being search are set to UTF-8. Edited December 3, 2018 by iwato Link to comment Share on other sites More sharing options...
justsomeguy Posted December 3, 2018 Share Posted December 3, 2018 If you're absolutely sure that query should return something, I would follow up with the plugin developers or community. Link to comment Share on other sites More sharing options...
justsomeguy Posted December 3, 2018 Share Posted December 3, 2018 I don't know what you mean in this sentence, though: Simply any Japanese word search that needs to be parsed is not returned. For example, the word 日本語 can be search and found, but does not require any parsing to be retrieved. The search for any other Japanese word in the newsletter fails. I don't know why some words need to be parsed and others don't, or maybe what you mean by parsing in that context. Link to comment Share on other sites More sharing options...
iwato Posted December 3, 2018 Author Share Posted December 3, 2018 Quote I don't know why some words need to be parsed and others don't In most languages individual words can be sought and found because all words in a text are separated by blank spaces. For example, I have no trouble performing search and find operations for Arabic, Korean, or English. Chinese and Japanese words, however, are not separated by blank spaces and are impossible, as a result, to distinguish from their neighbors. The ngram and mecab parsers do not depend on blank spaces to perform their search, rather they search according to the number of bytes that typically make up Japanese and Chinese words: one, two, three, four, and at most five characters, generally speaking. One and two character words are the more common. The word 日本語 (Japanese language) is returned because it appears with a space before and after. The word 本質, however, is surrounded by text on either side. It is not delimited by spaces. Now, I just performed the additional MySQL command and discovered the following: SHOW STATUS LIKE 'mecab_charset'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | mecab_charset | utf8 | +---------------+-------+ 1 row in set (0.03 sec) This is additional proof that there is nothing wrong with my installation. My intuition suggests that the problem may lie with the DEFAULT CHARSET of my data table that is set to Latin. This said, changing the character set of a table can be problematic, and the columns themselves are correctly set to UTF-8. Thus, before proceeding I would like to ascertain the character set requirements for the InnoDB MySQL table when using the mecab parser. This, however, I have been unable to find. Roddy Link to comment Share on other sites More sharing options...
justsomeguy Posted December 4, 2018 Share Posted December 4, 2018 This is additional proof that there is nothing wrong with my installation. If you want to start removing possibilities you can do that, but if I saw something that wasn't making sense I wouldn't rush to say that certain things are definitely not the cause. You have indications that it is installed correctly, but I wouldn't switch from "indication" to "proof" too quickly, especially considering the lack of success of some of your past assumptions. That being said, maybe everything is fine and the plugin just doesn't work. I haven't been watching you do everything so I can't say what is or is not likely. My intuition suggests that the problem may lie with the DEFAULT CHARSET of my data table that is set to Latin. So, fix it. Create a second table that is set to the correct character set. Export the data from one table, and then import into the other table. Or, even better, start with clean data and only add your test data. No reason to add a possibility of incorrect data import. The fewer variables the better. This is why people say to create a minimal example showing the problem. The more variables you can eliminate, the easier it will be to figure out. If you have something you suspect might be the problem that you want to test, then test it. Create a test case and test it. Eliminate other variables. In fact, you should start with the absolute minimal setup required and make sure that works. If it doesn't work at all then it's either not installed correctly or it doesn't work. Thus, before proceeding I would like to ascertain the character set requirements for the InnoDB MySQL table when using the mecab parser. This, however, I have been unable to find. This is another reason to contact the developers or their community, although I would be surprised if it was anything other than UTF-8. 1 Link to comment Share on other sites More sharing options...
iwato Posted May 12, 2019 Author Share Posted May 12, 2019 So, in the end, the table was corrupted, and I had no back up. Alas, I reconstructed the entire table with UTF-8 encoding. The result was nothing but spectacular. I now have an in-tact Mecab parser that parses Japanese. Not only this, but the Mecab parser does not interfere with the parsing of French, English, and German., or likely any other language that using blank spaces to separate words. Arabic will be my next challenge, but I am fairly confident that it will work just like English with no additional encoding required. Roccy 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now