Views Bulk Operations and Taxonomies – school of hard knocks

Views Bulk Operations is a great tool for making batched changes to large amounts of data. I have been using it for the last couple of days to add taxonomy terms to the around 17,000 new nodes that I added recently. And I just learned something the hard way.

Take care if you make bulk changes to taxonomies. It is not too difficult to inadvertently add the same term to a specific vocabulary over and over this way. Vocabs are meant to be updated dynamically; as the user add new terms to content, they are added to the vocab. On a individual basis, the autocomplete or dropdowns would ensure that the same term is reused if supplied instead of adding a new term for the SAME TERM. Which is what I did by mistake yesterday. And it was a mess.

Because I didn’t approach the VBO operation properly, I added the same term over and over again, to the tune of 8,000 times. Yuck. So, when I realized what I had done, I used VBO Delete to remove the extraneous terms. But what I didn’t consider was the overall impact of such a move. When I deleted all the individual terms that way, I deleted the nodes as well. 8,000 of them. Ouch.

A quick call to the BEST HOSTING COMPANY EVER – Blackmesh – and a restore was underway and the lesson was absorbed with tail between legs. I lost a whole day’s work, yes. But I gained a lot of insight as to how not to make this mistake again.

This is a fairly high level approach to what I did:

  • Created a vocabulary for the Content type
  • Created all the terms that I plan to use for this project
  • Added a term reference on the content type linked to the vocab
  • Used the autocomplete widget for the field
  • In the VBO View for this, chose the field for that content type that will hold the tax term
  • When running the VBO, I made sure the choose the predefined tax term. Since I used the autocomplete widget, i would type the first letters and wait for the complete choice to appear. This ensured that the existing term was being used
  • There was one node that needed its own term. I ran the VBO against this one node and added a new term to make sure that I was correct about new terms being added automatically. I was correct. This also confirmed that previously I had added what the system thought was a new term, every time the VBO changed a node
  • as I went through this, I checked the vocab to make sure that the number of terms was consistent with the original terms I had added.

Drupal Site indexing – MySQL Errors, CRON timeouts

Since I recently dumped almost 17,000 new nodes into my DB in a relatively short time, I have been keeping a close eye on how the back end is responding. The main concern that I have is the indexing process for the new data. I began receiving errors from MySQL as seen in the SS.

In the Drupal Admin UI, check the number of nodes that are indexed per

Image

CRON run. In my case, I had it set to 500, the maximum. This was a bit of overkill and I ended up making the value lower. I also increased the PHP memory, which you can check under “status reports”.

…PHP 5.3.27 (more information)
OK
PHP extensions Enabled
OK
PHP memory limit 512M
OK
PHP register globals Disabled…

I really had simply to tweak the settings. I ended up at 100 nodes per run, one run per hour and PHP mem allocation as seen above. From the Search Options UI, you can see the status of the indexing, how fast, how many nodes, etc.

VI Editor and Cron

One thing about the last post. To edit the cron file, you will need to use VI editor, which can be a little tricky if you’ve never used it.

As far as Centos goes, from the command line type:

  • crontab -e
  • Shift-i for INSERT – a confirmation of which  you will see at the bottom left.
  • Type the characters you need.
  • When done, hit Esc.
  • Shift-zz (capital Zs) to save the file. The command line will reappear with a message about the crontab.

That’s it. VI is a goofy tool but it has been around forever and isn’t going anywhere, so learn it. At least the basics.

Aggregators, CRON Jobs and Drupal cleanup.

This was a really involved project. If you use the Aggregator Core module a lot, take a look. I depend on Aggregators more than anything right now and have really had to do some involved work with it. Read on:

I have aggregator needs that the core doesn’t really quite give me. but it does work pretty well. here is what I collect:

  • 50+ feeds from various newspapers culled hourly resulting in several hundred articles per day.
  • Each RSS source is categorized (automatically, by default in drupal) as z-Uncategorized which corresponds to a CID (in the drupal DB) of 22. 
  • As the articles come in, I review and categorize them. I have a shortcut to the z-uncategorized category of items. That gives me all the new items, regardless of the source in one place where I can categorize them quickly by clicking on the categorize tab provided by Core. I keep about 10% of the stories that come in.
  • Because the newspapers maintain articles in their RSS feeds for a period of time beyond my control, they are readded to drupal’s DB whenever the feed is pulled; but now listed with two categories. There are now two entries for each of these stories with the same IID but a different CID. It looks like this below. There is the default z-uncat… category and the Juvenile category that I chose before the feed was queried again.
  • Even thought this looks like one record, it is really two different records in the tables. So, if I look at the aggregator_category_item table, I can see two records for the one IID. One with CID of 22 (the default, z-uncategorized) and the other of whatever I assigned it to. So, I can run a query and delete all with category 22. But, until the newspaper removes it from THEIR feed, it continues to come through.
  • I perform a nightly clean up where I delete all the 22s. This occurs when the papers are slow and new items have all been categorized by me.
  • Eventually (after a few days for most news sources) the stories are removed from the papers’ RSS feeds and do not get repopulated in Drupal with the default of CID 22. So then I am left with a nice single record in the category that I have assigned it to. By cleaning up every night, I get rid of stale 22s as the newspaper removes them from their RSS feed and I don’t have to think about whether they still have it or not.

Image

This is the cron job that I have to do the clean up.

0 22 * * * /usr/bin/mysql –defaults-file=”/home/xxxx/.my.cnf_cron” -e “DELETE FROM drupal.aggregator_category_item WHERE aggregator_category_item.cid = 22” >>/dev/null 2>&1

The .my.cnf.cron file contains authentication information

[client]
host=localhost
user=crondel
password=*****

The user and password is a mysql specific user I created for this job.

The 0 22 * * * means that it will run at 10 PM EST every night. EST because that is the time zone for the server.

Here are the specific rights for the crondel account name for the drupal DB, named, drupal.

GRANT USAGE ON *.* TO ‘crondel’@’localhost’ IDENTIFIED BY PASSWORD ‘*6E52D2AA6010C379DE1AE3BC559E2416A9A5C513’
GRANT SELECT, DELETE ON `drupal`.`aggregator_category_item` TO ‘crondel’@’localhost’

The account needs SELECT rights to execute the WHERE condition of the SQL statement in addition to the DELETE FROM on the specific table in the DB.

You might ask, why not do all this with Feeds? Well, I did try to do it with Feeds. I spent quite a bit of time with it. But feeds grabs each RSS item as a node. And I could not figure out an easy way to categorize the hundreds of stories per day when they all come in as nodes. And since this DB will eventually be huge with 100k+ stories in a searchable archive, I think that it may be easier to keep it this way. I just had to figure out what to do with the extra 22s. And this solution seems to work.

Ug. This was a pain. And if you want to know more about the subject or I have been unclear, let me know and I’ll try to clarify.