Days 4 through 7: Long Live The Query Service!

This may be the last of the technical setup blogs for a while, fingers crossed.

I am pleased to report that I did in fact manage to get the Wikidata Query Service running for wikibase.plantdata.io, using addshore’s docker setup. The Query Service is in docker; The associated wikibase instance is not.

Here’s a link with a sample query, which I assure you is extremely exciting.

This query service is the thing that will allow us to pull data out of the plantdata project, and while collaboratively organizing giant piles of data is loads of fun (yes, really!), data going in to something isn’t any good unless you can get it out again. Having a working query service  is extremely exciting, because now we can get the data out.

Setting this up was not easy. Fortunately, I had plenty of leave time to burn with nothing else to do, and some convenient insider-access to a couple of wonderfully generous people who do this work for the main Wikidata project. I remain convinced that without both an unrealistic abundance of time and personal connections to lean on, I never would have gotten this done unless someone was paying me to play around with it.

Hopefully, the following takeaways and ‘gotchas’ I ran into will help make it easier for the next round of people who would like to explore this kind of setup.

Recent Changes Feed Expiration

This one goes first, because this issue stands the biggest chance of being bad news for pre-existing wikibase instances that want to start using the query service after a lot of data has already been established in the system. It’s important to understand that the default way for the Wikidata Query Service to get its updates, is in a more or less continuous fashion from the Recent Changes feed on your wikibase instance. If you are ever hoping to use the Query Service, the easiest way to make sure it can catch all the changes you make is to dramatically increase the amount of time those changes are kept of the feed before the feed is pruned.

With that in mind, I added this to my LocalSettings:

//prevents recentChanges from being purged for a very long time.
//this is in seconds. So... 5 years * 365 * 24 * 3600 =
$wgRCMaxAge = 157680000;

https://www.mediawiki.org/wiki/Manual:$wgRCMaxAge

If you’re running a wikibase instance and have a lot of data that has already been pruned from your feed, you do have the option to go through what looks like a lengthy process of dumping your wikibase data, reformatting it, and importing it to the query service. The wikibase dump happens in a maintenance script, but I haven’t found any documentation better than looking at the files – and the reformatting and loading process looks pretty confusing too. With any luck, that Recent Changes MaxAge setting will mean I won’t need to look at any of that directly, at least for a while.

Additionally, and somewhat sneakily: I have reason to suspect that if you just… lightly edit your items to get them mentioned in your recent changes feed again, the query service will pick up the whole item and not just the specific thing about it that you change. If you do some testing around something like this to avoid the dump/upload process, please do let me know if it works or not.

HTTPS: Not Yet

I tried to start plantdata.io out defaulting to a secure connection for everything. Unfortunately, that decision made it functionally impossible to use the Query Service, in about a half dozen distinct ways. I understand that some patches are coming to address the issues I found with the query service and docker implementation, and indeed some patches have already been merged to improve the situation. However, there are still barriers that prevent the query service from getting any of my object or property data. In the end, in the interests of getting this off the ground in a timely manner, I decided to change the config to allow both http and https. Surprisingly, even that didn’t do it: One problem only went away after I removed the https from the $wgServer variable in LocalSettings. I’ll happily move back to HTTPS throughout the system later, but for now: http or give up.

.htaccess and other ShortURL changes

I usually don’t bother with rewriting URLs in personal mediawiki installs, but if you want the query service to be able to talk to your wikibase instance, it’s a requirement. There are a few re-written URLs that have to work where the query service expects them to be.

Recent changes feed, accessed by API:
https://wikibase.plantdata.io/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7cids%7ctimestamp&rcnamespace=120%7c122&rclimit=100&continue=&rcstart=20170301000000

Object’s entity data:
wikibase.plantdata.io/wiki/Special:EntityData/Q4.ttl

Wikidata example for comparison:
https://www.wikidata.org/wiki/Special:EntityData/Q4.ttl

To get all this to work, I had to do three things:

  1. Move my mediawiki installation out of webroot, and into a /w/ directory.
  2. Steal the rewrite lines (so, everything) from the .htaccess file currently being used in the wikibase docker container: https://github.com/wmde/wikibase-docker/blob/master/wikibase/1.30/htaccess
  3. Add a couple things to LocalSettings, as per the ShortURL manual on mediawiki.org
## https://www.mediawiki.org/wiki/Manual:Short_URL
$wgScriptPath = "/w";        // this should already have been configured this way
$wgArticlePath = "/wiki/$1";

Once those URLs work for your own site, you’re ready to fire up the docker containers.

WDQS Docker and Config

I’m using the docker images for the wikidata query service, here:
https://github.com/wmde/wikibase-docker/

This is the docker-compose.yml file that got everything to work for my instance.
https://phabricator.wikimedia.org/P6968

Once you have cloned the wikibase docker repo somewhere you’ve installed docker, replace the docker-compose file with something like mine, point it to your own instance everywhere it says WIKIBASE_HOST, and bring up the containers.

Note from a previous day: The Wikidata Query Service is an unbelievable memory hog. I couldn’t get the wdqs container to start properly until I sprang for a machine with 8 GB of memory, up from 4. So, if blazegraph just won’t start, consider… feeding it.

If you’ve never used docker before, you will eventually find that you need to be a little aggressive  about removing containers sometimes. This is particularly true if you change things in the compose file, or one or more of your containers writes to a volume. You’ll want to remove everything you can before bringing the new stuff up, and make sure that the new containers don’t use some cached version of something you thought you removed.

This sequence of events is the one I’ve come to prefer, even if it does take a little longer:

docker-compose down --volumes
docker-compose rm -f

But that doesn’t actually take care of the volumes all the time. Always check:

docker-volume ls

And if that returns anything

docker-volume rm volumename

Now, you can bring things back up.

docker-compose pull
docker-compose build --no-cache
docker-compose up

If you’re not particularly interested in the output, run that last up command with a -d to run it in the background.

If there’s a reasonable way to be more aggressive about making sure you’re not dealing with hangover bugs from a previous docker-compose command, I haven’t found it yet.

Reaching Into The Past

The containerized wdqs is configured not to go too far into the past, importing your recent changes. You probably want it to go farther than it’s going to go by default, if you have any data ready to go.

So, in the same directory where you’re just run all the docker-compose commands, while your containers are up, try something like this:

docker exec wikibasedocker_wdqs_1 ./runUpdate.sh -h http://wdqs.svc:9999 -- --wikibaseHost wikibase.plantdata.io --wikibaseScheme http --entityNamespaces 120,122 -s 20170301000000 --init

The host should again be yours, and the timestamp at the end is just 4 digits of year, followed by two month and two day digits. Then, I assume hours, minutes, and seconds, but I didn’t need to get terribly precise with that. Like the regular updater script, it will keep running every ten seconds until you kill it. You can probably kill it after one.

Try Out Some Queries!

This is a good start that will let you know if anything landed that looks familiar:

SELECT * WHERE { ?x ?y ?z } Limit 50

“But,” you may say, “the typeahead on the query helper isn’t working!” Read on…

Unresolved Problems

The Query Service frontend makes some calls to the wikibase api endpoint, expecting CirrusSearch to be on and enabled. Unfortunately, with that parameter present, the calls don’t work, and this mechanism is what makes the typeahead work in the query helper. I thought momentarily about spinning up an ElasticSearch server, but… no, I’ll probably just go hack something into my instance of mediawiki while I’m waiting for the wdqs patch to make it all the way to the docker image (because there is already a patch!). I would definitely like the helper working: I can use all the SPARQL help I can get at the moment.

 

Here’s the URL it’s currently trying to use:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&useCirrus=1&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

 

And, the same thing without the useCirrus parameter:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

Relatively small thing, considering.

From here on out, the plan appears to be a lot more fun:

  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)
  • Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.

Major thanks to addshore and Stas for being available for troubleshooting, dispensing of expert advice, and occasional patch writing throughout this process.

Leave a Reply

Your email address will not be published. Required fields are marked *