Initial Thoughts on Data, a Start on a New Importer

Clearly, I haven’t been keeping up with the blog on a daily basis as I initially intended. This isn’t because I’m stuck, or not working on things; It’s more like I don’t want to stop working on the actual problems in order to distract myself with blog posts. This probably has something to do with the fact that I’ve finally gotten to the part where I get to do some coding, and, well, I get in the zone. So, from here on out, I’m going to aim for every week or so on updates.

The last seven project days have largely been spent pouring over plant species items in Wikidata, making some decisions about what I want to keep on the plantdata instance of wikibase, and getting a start on writing a wikibase data importer tool.

Data Thoughts

As a person who has spent many years designing and modifying data systems, I have a horrible allergic reaction whenever I see too much data duplicated when a reference would work. So, how much duplication is too much? In my experience: Most of it, unless not duplicating is preventing cool queries from being possible. For that reason, I’ve decided not to import much at all in the way of species-level property data from Wikidata. At the moment, the plan is to pull in five properties and the item ID for all plant species, which is something like 471,000 Wikidata items.

In the process of making that decision, I created a spreadsheet of common properties used on Wikidata plants, and what I wanted to do with them initially. Due to volume and my aforementioned allergic reactions to data duplication, I will not be pulling in any of the external db reference IDs, but Wow: Some of them would make excellent sources for doing research and populating the data do I want to keep here. It would certainly be worthwhile to surface these as possible sources to reference when users discover and wish to fill in holes in the data.

In regards to the data that I do want to keep: In another tab, I was working out what data I want to keep on plantdata.io for every species, variety, and cultivar, roughly grouped by something that’s almost (but not quite) use case.

A New Data Importer

Somewhere in there, I also got started writing the plantdata-importer script: A new command-line tool to import .tsv files to the wikibase instance of your choice. “Started” is the keyword there, though: At present, It doesn’t do anything beyond verifying that it can do some simple item and property searches (to make sure you’re not about to duplicate thousands of items, for instance…), log in to the specified wikibase instance, and instantiate a WikibaseFactory object (defined in the wikibase-api library) which will eventually do the actual edits.

I’d very much like this tool to attempt some reasonable property mapping decisions around column headers in the import files, so that’s the part of the tool I’m focused on at the moment. As soon as this thing works, it’s import time.

The plan:

  • Finish the plantdata-importer.
  • Start populating data in the real plantdata wikibase instance.
  • Get comfortable with SPARQL and write some sample queries.
  • Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.

 

Days 4 through 7: Long Live The Query Service!

This may be the last of the technical setup blogs for a while, fingers crossed.

I am pleased to report that I did in fact manage to get the Wikidata Query Service running for wikibase.plantdata.io, using addshore’s docker setup. The Query Service is in docker; The associated wikibase instance is not.

Here’s a link with a sample query, which I assure you is extremely exciting.

This query service is the thing that will allow us to pull data out of the plantdata project, and while collaboratively organizing giant piles of data is loads of fun (yes, really!), data going in to something isn’t any good unless you can get it out again. Having a working query service  is extremely exciting, because now we can get the data out.

Setting this up was not easy. Fortunately, I had plenty of leave time to burn with nothing else to do, and some convenient insider-access to a couple of wonderfully generous people who do this work for the main Wikidata project. I remain convinced that without both an unrealistic abundance of time and personal connections to lean on, I never would have gotten this done unless someone was paying me to play around with it.

Hopefully, the following takeaways and ‘gotchas’ I ran into will help make it easier for the next round of people who would like to explore this kind of setup.

Recent Changes Feed Expiration

This one goes first, because this issue stands the biggest chance of being bad news for pre-existing wikibase instances that want to start using the query service after a lot of data has already been established in the system. It’s important to understand that the default way for the Wikidata Query Service to get its updates, is in a more or less continuous fashion from the Recent Changes feed on your wikibase instance. If you are ever hoping to use the Query Service, the easiest way to make sure it can catch all the changes you make is to dramatically increase the amount of time those changes are kept of the feed before the feed is pruned.

With that in mind, I added this to my LocalSettings:

//prevents recentChanges from being purged for a very long time.
//this is in seconds. So... 5 years * 365 * 24 * 3600 =
$wgRCMaxAge = 157680000;

https://www.mediawiki.org/wiki/Manual:$wgRCMaxAge

If you’re running a wikibase instance and have a lot of data that has already been pruned from your feed, you do have the option to go through what looks like a lengthy process of dumping your wikibase data, reformatting it, and importing it to the query service. The wikibase dump happens in a maintenance script, but I haven’t found any documentation better than looking at the files – and the reformatting and loading process looks pretty confusing too. With any luck, that Recent Changes MaxAge setting will mean I won’t need to look at any of that directly, at least for a while.

Additionally, and somewhat sneakily: I have reason to suspect that if you just… lightly edit your items to get them mentioned in your recent changes feed again, the query service will pick up the whole item and not just the specific thing about it that you change. If you do some testing around something like this to avoid the dump/upload process, please do let me know if it works or not.

HTTPS: Not Yet

I tried to start plantdata.io out defaulting to a secure connection for everything. Unfortunately, that decision made it functionally impossible to use the Query Service, in about a half dozen distinct ways. I understand that some patches are coming to address the issues I found with the query service and docker implementation, and indeed some patches have already been merged to improve the situation. However, there are still barriers that prevent the query service from getting any of my object or property data. In the end, in the interests of getting this off the ground in a timely manner, I decided to change the config to allow both http and https. Surprisingly, even that didn’t do it: One problem only went away after I removed the https from the $wgServer variable in LocalSettings. I’ll happily move back to HTTPS throughout the system later, but for now: http or give up.

.htaccess and other ShortURL changes

I usually don’t bother with rewriting URLs in personal mediawiki installs, but if you want the query service to be able to talk to your wikibase instance, it’s a requirement. There are a few re-written URLs that have to work where the query service expects them to be.

Recent changes feed, accessed by API:
https://wikibase.plantdata.io/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7cids%7ctimestamp&rcnamespace=120%7c122&rclimit=100&continue=&rcstart=20170301000000

Object’s entity data:
wikibase.plantdata.io/wiki/Special:EntityData/Q4.ttl

Wikidata example for comparison:
https://www.wikidata.org/wiki/Special:EntityData/Q4.ttl

To get all this to work, I had to do three things:

  1. Move my mediawiki installation out of webroot, and into a /w/ directory.
  2. Steal the rewrite lines (so, everything) from the .htaccess file currently being used in the wikibase docker container: https://github.com/wmde/wikibase-docker/blob/master/wikibase/1.30/htaccess
  3. Add a couple things to LocalSettings, as per the ShortURL manual on mediawiki.org
## https://www.mediawiki.org/wiki/Manual:Short_URL
$wgScriptPath = "/w";        // this should already have been configured this way
$wgArticlePath = "/wiki/$1";

Once those URLs work for your own site, you’re ready to fire up the docker containers.

WDQS Docker and Config

I’m using the docker images for the wikidata query service, here:
https://github.com/wmde/wikibase-docker/

This is the docker-compose.yml file that got everything to work for my instance.
https://phabricator.wikimedia.org/P6968

Once you have cloned the wikibase docker repo somewhere you’ve installed docker, replace the docker-compose file with something like mine, point it to your own instance everywhere it says WIKIBASE_HOST, and bring up the containers.

Note from a previous day: The Wikidata Query Service is an unbelievable memory hog. I couldn’t get the wdqs container to start properly until I sprang for a machine with 8 GB of memory, up from 4. So, if blazegraph just won’t start, consider… feeding it.

If you’ve never used docker before, you will eventually find that you need to be a little aggressive  about removing containers sometimes. This is particularly true if you change things in the compose file, or one or more of your containers writes to a volume. You’ll want to remove everything you can before bringing the new stuff up, and make sure that the new containers don’t use some cached version of something you thought you removed.

This sequence of events is the one I’ve come to prefer, even if it does take a little longer:

docker-compose down --volumes
docker-compose rm -f

But that doesn’t actually take care of the volumes all the time. Always check:

docker-volume ls

And if that returns anything

docker-volume rm volumename

Now, you can bring things back up.

docker-compose pull
docker-compose build --no-cache
docker-compose up

If you’re not particularly interested in the output, run that last up command with a -d to run it in the background.

If there’s a reasonable way to be more aggressive about making sure you’re not dealing with hangover bugs from a previous docker-compose command, I haven’t found it yet.

Reaching Into The Past

The containerized wdqs is configured not to go too far into the past, importing your recent changes. You probably want it to go farther than it’s going to go by default, if you have any data ready to go.

So, in the same directory where you’re just run all the docker-compose commands, while your containers are up, try something like this:

docker exec wikibasedocker_wdqs_1 ./runUpdate.sh -h http://wdqs.svc:9999 -- --wikibaseHost wikibase.plantdata.io --wikibaseScheme http --entityNamespaces 120,122 -s 20170301000000 --init

The host should again be yours, and the timestamp at the end is just 4 digits of year, followed by two month and two day digits. Then, I assume hours, minutes, and seconds, but I didn’t need to get terribly precise with that. Like the regular updater script, it will keep running every ten seconds until you kill it. You can probably kill it after one.

Try Out Some Queries!

This is a good start that will let you know if anything landed that looks familiar:

SELECT * WHERE { ?x ?y ?z } Limit 50

“But,” you may say, “the typeahead on the query helper isn’t working!” Read on…

Unresolved Problems

The Query Service frontend makes some calls to the wikibase api endpoint, expecting CirrusSearch to be on and enabled. Unfortunately, with that parameter present, the calls don’t work, and this mechanism is what makes the typeahead work in the query helper. I thought momentarily about spinning up an ElasticSearch server, but… no, I’ll probably just go hack something into my instance of mediawiki while I’m waiting for the wdqs patch to make it all the way to the docker image (because there is already a patch!). I would definitely like the helper working: I can use all the SPARQL help I can get at the moment.

 

Here’s the URL it’s currently trying to use:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&useCirrus=1&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

 

And, the same thing without the useCirrus parameter:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

Relatively small thing, considering.

From here on out, the plan appears to be a lot more fun:

  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)
  • Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.

Major thanks to addshore and Stas for being available for troubleshooting, dispensing of expert advice, and occasional patch writing throughout this process.