Additional Drupal Search Resources

I am really surprised at the dearth of documentation on Search options/modules on Drupal.org. So I am compiling a running list of different sites that have actual details on what you can do with Search and how best to achieve them.

https://drupal.org/node/343467 – Probably the best place for docs on Solr and Drupal, and all the mods that go with them.

http://envisioninteractive.com/drupal/drupal-7-views-with-faceted-filters-without-apachesolr/ – Great one for those who want to use the out-of-the-box search_db backend.

http://www.acquia.com/blog/simple-guide-install-apache-solr-3x-drupal-7 – One for Solr – which I believe I am going to end up using.

http://www.lullabot.com/blog/article/installing-solr-use-drupal – Installing Solr. GREAT TUTORIAL from LULLABOT

http://xmodulo.com/2013/02/how-to-install-apache-tomcat-on-centos.html – installing apache tomcat for Solr on Centos

http://www.mkyong.com/tomcat/how-to-check-tomcat-version-installed/ – Tomcat version

http://quark.humbug.org.au/publications/notes/bofh/msg00027.html – tomcat authentication

http://wiki.apache.org/solr/SolrTomcat – troubleshooting Solr

http://zugec.com/73-how-setup-search-api-apache-solr – installing solr and drupal info

Parse File Attachments with Apache Tika

I have been up to my neck in various Drupal search modules/configs/nightmare scenarios for almost a month now. But since Google has set the bar as high as they have, search must be easy, fast and accurate.

If you have the resources, an Apache Solr server is probably the way to go. But if you don’t have the infrastructure for that, there are still a lot of options out there. After working with the Search Files search mods and the native drupal search, I have decided to go with The Search API module with some of its submods – specifically, the Search API Attachments mod. This will allow me to parse attached documents in several different formats, including PDF, the one that I am mainly concerned with.

To parse attachments, you have to have some sort of a helper app installed. In the case of the Search API Attachments module, that would be Apache Tika.  Plus, Tika is also needed for Solr Server so this doc also serves that end as well.

Here are some of the prerequisites:

  1. Java 1.6 – http://xmodulo.com/2012/05/how-to-install-java-16-in-linux.html
  2. Apache Maven – http://xmodulo.com/2012/05/how-to-install-maven-on-centos.html
  3. Tika – the source from which the .jar will be compiled – http://tika.apache.org/0.7/gettingstarted.html

Once you have the prereqs installed, you can run:

mvn install from the root directory of the tika files. This will run for about three minutes and will end up compiling a nice .jar file that you’ll need.

http://www.acquia.com/fr/blog/use-apache-solr-search-files – read this great article on how to install Tika specifically for Drupal.

Once the compile is complete, you can test the parser by running this command:

java -jar ./tika-app/target/tika-app-1.6.jar -t /var/www/html/sites/all/pdfs/test.pdf

**NOTE THIS PATH REFERENCES THE LAYOUT OF MY SERVER**

If all the text from the PDF goes scrolling past your screen, you have everything installed correctly from the Tika/OS standpoint.

Now, configure Drupal to use the Tika install and you’ll be rolling.