Initial Thoughts on Data, a Start on a New Importer

Clearly, I haven’t been keeping up with the blog on a daily basis as I initially intended. This isn’t because I’m stuck, or not working on things; It’s more like I don’t want to stop working on the actual problems in order to distract myself with blog posts. This probably has something to do with the fact that I’ve finally gotten to the part where I get to do some coding, and, well, I get in the zone. So, from here on out, I’m going to aim for every week or so on updates.

The last seven project days have largely been spent pouring over plant species items in Wikidata, making some decisions about what I want to keep on the plantdata instance of wikibase, and getting a start on writing a wikibase data importer tool.

Data Thoughts

As a person who has spent many years designing and modifying data systems, I have a horrible allergic reaction whenever I see too much data duplicated when a reference would work. So, how much duplication is too much? In my experience: Most of it, unless not duplicating is preventing cool queries from being possible. For that reason, I’ve decided not to import much at all in the way of species-level property data from Wikidata. At the moment, the plan is to pull in five properties and the item ID for all plant species, which is something like 471,000 Wikidata items.

In the process of making that decision, I created a spreadsheet of common properties used on Wikidata plants, and what I wanted to do with them initially. Due to volume and my aforementioned allergic reactions to data duplication, I will not be pulling in any of the external db reference IDs, but Wow: Some of them would make excellent sources for doing research and populating the data do I want to keep here. It would certainly be worthwhile to surface these as possible sources to reference when users discover and wish to fill in holes in the data.

In regards to the data that I do want to keep: In another tab, I was working out what data I want to keep on plantdata.io for every species, variety, and cultivar, roughly grouped by something that’s almost (but not quite) use case.

A New Data Importer

Somewhere in there, I also got started writing the plantdata-importer script: A new command-line tool to import .tsv files to the wikibase instance of your choice. “Started” is the keyword there, though: At present, It doesn’t do anything beyond verifying that it can do some simple item and property searches (to make sure you’re not about to duplicate thousands of items, for instance…), log in to the specified wikibase instance, and instantiate a WikibaseFactory object (defined in the wikibase-api library) which will eventually do the actual edits.

I’d very much like this tool to attempt some reasonable property mapping decisions around column headers in the import files, so that’s the part of the tool I’m focused on at the moment. As soon as this thing works, it’s import time.

The plan:

  • Finish the plantdata-importer.
  • Start populating data in the real plantdata wikibase instance.
  • Get comfortable with SPARQL and write some sample queries.
  • Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.

 

Days 4 through 7: Long Live The Query Service!

This may be the last of the technical setup blogs for a while, fingers crossed.

I am pleased to report that I did in fact manage to get the Wikidata Query Service running for wikibase.plantdata.io, using addshore’s docker setup. The Query Service is in docker; The associated wikibase instance is not.

Here’s a link with a sample query, which I assure you is extremely exciting.

This query service is the thing that will allow us to pull data out of the plantdata project, and while collaboratively organizing giant piles of data is loads of fun (yes, really!), data going in to something isn’t any good unless you can get it out again. Having a working query service  is extremely exciting, because now we can get the data out.

Setting this up was not easy. Fortunately, I had plenty of leave time to burn with nothing else to do, and some convenient insider-access to a couple of wonderfully generous people who do this work for the main Wikidata project. I remain convinced that without both an unrealistic abundance of time and personal connections to lean on, I never would have gotten this done unless someone was paying me to play around with it.

Hopefully, the following takeaways and ‘gotchas’ I ran into will help make it easier for the next round of people who would like to explore this kind of setup.

Recent Changes Feed Expiration

This one goes first, because this issue stands the biggest chance of being bad news for pre-existing wikibase instances that want to start using the query service after a lot of data has already been established in the system. It’s important to understand that the default way for the Wikidata Query Service to get its updates, is in a more or less continuous fashion from the Recent Changes feed on your wikibase instance. If you are ever hoping to use the Query Service, the easiest way to make sure it can catch all the changes you make is to dramatically increase the amount of time those changes are kept of the feed before the feed is pruned.

With that in mind, I added this to my LocalSettings:

//prevents recentChanges from being purged for a very long time.
//this is in seconds. So... 5 years * 365 * 24 * 3600 =
$wgRCMaxAge = 157680000;

https://www.mediawiki.org/wiki/Manual:$wgRCMaxAge

If you’re running a wikibase instance and have a lot of data that has already been pruned from your feed, you do have the option to go through what looks like a lengthy process of dumping your wikibase data, reformatting it, and importing it to the query service. The wikibase dump happens in a maintenance script, but I haven’t found any documentation better than looking at the files – and the reformatting and loading process looks pretty confusing too. With any luck, that Recent Changes MaxAge setting will mean I won’t need to look at any of that directly, at least for a while.

Additionally, and somewhat sneakily: I have reason to suspect that if you just… lightly edit your items to get them mentioned in your recent changes feed again, the query service will pick up the whole item and not just the specific thing about it that you change. If you do some testing around something like this to avoid the dump/upload process, please do let me know if it works or not.

HTTPS: Not Yet

I tried to start plantdata.io out defaulting to a secure connection for everything. Unfortunately, that decision made it functionally impossible to use the Query Service, in about a half dozen distinct ways. I understand that some patches are coming to address the issues I found with the query service and docker implementation, and indeed some patches have already been merged to improve the situation. However, there are still barriers that prevent the query service from getting any of my object or property data. In the end, in the interests of getting this off the ground in a timely manner, I decided to change the config to allow both http and https. Surprisingly, even that didn’t do it: One problem only went away after I removed the https from the $wgServer variable in LocalSettings. I’ll happily move back to HTTPS throughout the system later, but for now: http or give up.

.htaccess and other ShortURL changes

I usually don’t bother with rewriting URLs in personal mediawiki installs, but if you want the query service to be able to talk to your wikibase instance, it’s a requirement. There are a few re-written URLs that have to work where the query service expects them to be.

Recent changes feed, accessed by API:
https://wikibase.plantdata.io/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7cids%7ctimestamp&rcnamespace=120%7c122&rclimit=100&continue=&rcstart=20170301000000

Object’s entity data:
wikibase.plantdata.io/wiki/Special:EntityData/Q4.ttl

Wikidata example for comparison:
https://www.wikidata.org/wiki/Special:EntityData/Q4.ttl

To get all this to work, I had to do three things:

  1. Move my mediawiki installation out of webroot, and into a /w/ directory.
  2. Steal the rewrite lines (so, everything) from the .htaccess file currently being used in the wikibase docker container: https://github.com/wmde/wikibase-docker/blob/master/wikibase/1.30/htaccess
  3. Add a couple things to LocalSettings, as per the ShortURL manual on mediawiki.org
## https://www.mediawiki.org/wiki/Manual:Short_URL
$wgScriptPath = "/w";        // this should already have been configured this way
$wgArticlePath = "/wiki/$1";

Once those URLs work for your own site, you’re ready to fire up the docker containers.

WDQS Docker and Config

I’m using the docker images for the wikidata query service, here:
https://github.com/wmde/wikibase-docker/

This is the docker-compose.yml file that got everything to work for my instance.
https://phabricator.wikimedia.org/P6968

Once you have cloned the wikibase docker repo somewhere you’ve installed docker, replace the docker-compose file with something like mine, point it to your own instance everywhere it says WIKIBASE_HOST, and bring up the containers.

Note from a previous day: The Wikidata Query Service is an unbelievable memory hog. I couldn’t get the wdqs container to start properly until I sprang for a machine with 8 GB of memory, up from 4. So, if blazegraph just won’t start, consider… feeding it.

If you’ve never used docker before, you will eventually find that you need to be a little aggressive  about removing containers sometimes. This is particularly true if you change things in the compose file, or one or more of your containers writes to a volume. You’ll want to remove everything you can before bringing the new stuff up, and make sure that the new containers don’t use some cached version of something you thought you removed.

This sequence of events is the one I’ve come to prefer, even if it does take a little longer:

docker-compose down --volumes
docker-compose rm -f

But that doesn’t actually take care of the volumes all the time. Always check:

docker-volume ls

And if that returns anything

docker-volume rm volumename

Now, you can bring things back up.

docker-compose pull
docker-compose build --no-cache
docker-compose up

If you’re not particularly interested in the output, run that last up command with a -d to run it in the background.

If there’s a reasonable way to be more aggressive about making sure you’re not dealing with hangover bugs from a previous docker-compose command, I haven’t found it yet.

Reaching Into The Past

The containerized wdqs is configured not to go too far into the past, importing your recent changes. You probably want it to go farther than it’s going to go by default, if you have any data ready to go.

So, in the same directory where you’re just run all the docker-compose commands, while your containers are up, try something like this:

docker exec wikibasedocker_wdqs_1 ./runUpdate.sh -h http://wdqs.svc:9999 -- --wikibaseHost wikibase.plantdata.io --wikibaseScheme http --entityNamespaces 120,122 -s 20170301000000 --init

The host should again be yours, and the timestamp at the end is just 4 digits of year, followed by two month and two day digits. Then, I assume hours, minutes, and seconds, but I didn’t need to get terribly precise with that. Like the regular updater script, it will keep running every ten seconds until you kill it. You can probably kill it after one.

Try Out Some Queries!

This is a good start that will let you know if anything landed that looks familiar:

SELECT * WHERE { ?x ?y ?z } Limit 50

“But,” you may say, “the typeahead on the query helper isn’t working!” Read on…

Unresolved Problems

The Query Service frontend makes some calls to the wikibase api endpoint, expecting CirrusSearch to be on and enabled. Unfortunately, with that parameter present, the calls don’t work, and this mechanism is what makes the typeahead work in the query helper. I thought momentarily about spinning up an ElasticSearch server, but… no, I’ll probably just go hack something into my instance of mediawiki while I’m waiting for the wdqs patch to make it all the way to the docker image (because there is already a patch!). I would definitely like the helper working: I can use all the SPARQL help I can get at the moment.

 

Here’s the URL it’s currently trying to use:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&useCirrus=1&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

 

And, the same thing without the useCirrus parameter:
http://wikibase.plantdata.io/w/api.php?action=wbsearchentities&format=json&limit=50&continue=0&language=en&uselang=en&search=species&type=property&callback=jQuery331001893678981134339_1523055846377&_=1523055846394

Relatively small thing, considering.

From here on out, the plan appears to be a lot more fun:

  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)
  • Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.

Major thanks to addshore and Stas for being available for troubleshooting, dispensing of expert advice, and occasional patch writing throughout this process.

Day 3: Upgrades and Stack Traces

Today, I bought a virtual private server for everything plantdata.io that I’m not planning on running out of a container, including this blog. I had been running it all on shared hosting. It took a bit to move everything over to the VPS, but things should be moving noticeably faster as a result. I had hoped it would be a bit longer until I had to upgrade my hosting, but I noticed it was occasionally taking longer than 10 seconds to load simple pages or contact the mediawiki api. Not only was it getting on my nerves, with response times like that, it would certainly cruise straight past updater script timeouts. This should be far less of a problem with the new VPS.

Once that was completed, I got back into the minor explosion I created at the end of Day 2 (which was not technically yesterday). I have so far been unable to get the containerized wikidata query service to pull data correctly from plantdata’s wikibase instance, but I think I know why that is now: Docker is doing some things I wasn’t expecting. This makes sense, as my expectations for what Docker does and does not do are roughly three days old at this point. It would be pretty weird if I already knew all the surprises after three days.

I did manage to catch a pretty good ‘gotcha’ on behalf of the crowd hoping to do similar things with a containerized wdqs pointing to an existing wikibase instance. Running runUpdate.sh in verbose mode revealed that the script assumes you’ve configured your wikibase instance to run at [domain]/w/. So, if your mediawiki api is configured to live somewhere other than [domain]/w/api.php, you will either need to do a redirect on the wikibase end, or hack your configured directory structure into runUpdate.sh. After reading some documented reasons why you probably don’t want to run mediawiki straight from the web root (it was), I opted to change the mediawiki configuration to match the script.

Tomorrow’s Plan is identical to yesterday’s plan in every way, including the part where it’s clearly a multi-day plan that couldn’t fit into a single day, no matter how great of a day it ends up being:

  • Continue learning things about docker, presumably by exploding and unexploding all the test containers until I stop being surprised by its behavior.
  • Finish writing my own compose file to forego the containerized wikibase instance, and instead point all the query service containers to my real pre-existing plantdata wikibase install
  • Verify that the instances are communicating the way I think they should be, or learn enough to alter my expectations
  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)

Day 2: Test-driving the Tools

Today was spent largely playing with docker again, trying to debug the issues I was having with the wikibase and wikidata query service test instances I started playing with yesterday. I am pleased to report that I did manage to fix the problems I was having yesterday in getting the query service to import data from the adjacent wikibase install. Turns out the data wasn’t loading, because the timestamp of the newest data was older than a timestamp on the updater system. I got a lot of this in the updater container output:

20:51:47.216 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during initialization.

wdqs-updater_1   | java.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time.  You will have to reload from scratch or you might have missing data.

Thanks again to the same very helpful expert from yesterday (Yay again, Stas!), I was able to fix this problem by running the runUpdate.sh script once with an additional -s parameter to reset the timestamp to something older. That way, it would rebuild more of the data, get a new timestamp from the latest update, and stop complaining in general.

To further complicate things, I quickly ran into another wall: I couldn’t reliably run a docker command on the updater container. Docker exec commands kept complaining that the container was in the process of restarting and advising me to try again, despite docker-compose ps telling me consistently that the container was up. I stopped and restarted that container and the whole set several times, and kept getting the same results. Instead of continuing to fight that one container, I ran the following command on the main wdqs container instance, which was stable and based on the same image as the updater container:

docker exec wikibasedocker_wdqs_1 ./runUpdate.sh -h http://wdqs.svc:9999 -- --wikibaseHost wikibase.svc --wikibaseScheme http --entityNamespaces 120,122 -s 20180301000000

Fixed! This enabled me to spend some time today basking in the utter confusion of trying to understand what on Earth SPARQL is all about, but with data behind it this time. The fact that this is only Day 2 means I am pretty far ahead of the general schedule of events I had vaguely guessed at before I actually started poking around with these tools. I may even be in a place where I can usefully start importing data to the production wikibase instance before this week is out.

It’s probably worth noting, though, that the updater container being flaky and restarting all the time is probably something I’ll have to revisit later. I can certainly imagine that the instability in that container was what caused the timestamps to go funny in the first place. Then again, this may be completely expected behavior for that container, but it certainly does seem suspicious.

I finished out the day starting to convince the compose file to have the query service look at the production location of the plantdata wikibase, and got just far enough to generate several pages of errors to dig into tomorrow. Nothing like finishing with a small explosion.

Tomorrow’s Plan:

  • Continue learning things about docker, presumably by exploding and unexploding all the test containers until I stop being surprised by its behavior.
  • Finish writing my own compose file to forego the containerized wikibase instance, and instead point all the query service containers to my real pre-existing plantdata wikibase install
  • Verify that the instances are communicating the way I think they should be, or learn enough to alter my expectations
  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)

Same as yesterday: This plan is clearly too big for one day. If I can manage to successfully import data from outside the docker setup to the containerized wikidata service, I will be positively delighted.

Today’s most useful link:

https://github.com/phusion/baseimage-docker/issues/319 – handily explained some warnings I kept running into while building things in docker. Looks like we’re extending a base image with the same issue being described in that thread, but they’re just warnings (which I believe strongly in addressing vs ignoring), they are pretty easy to fix with one line of code, and they didn’t seem to be breaking anything on their own. It’s just… you know. Less red in the compose output means fewer immediate mysteries to follow up on.

Day 1: The Beginning

Today was the first day of a two-month sabbatical that I am taking, in order to get this plantdata project off the ground. I have decided that every day I work on this, I will write a short blog post outlining what I accomplished during that day, things I learned, and next steps. I’m doing this for two reasons:

  1. While I have spent most of my adult life as a full-time coder, at some point not too far away that majority share will be in *managing* full-time coders. It’s been a long time since I’ve had the opportunity to focus on building something, and to put it bluntly, I am so rusty I can actually hear creaking noises sometimes. I’m hoping that keeping decent notes on a schedule will help me get back into the game.
  2. Shame, really. Shame as a motivator. I fully expect to spend about two weeks flailing wildly with very little to show for it, and having to tell everyone about the whole thing should force me to document mental progress better than I would if left entirely to my own unobserved flailing.

I have decided to start this journey with an investigation of the relatively recent docker images and compose work that’s been going on. I knew it wasn’t going to be entirely smooth sailing for me, as I have never before used docker, or successfully installed blazegraph on anything. Nevertheless, I was able to use docker-compose to spin up some instances in a couple hours.

One early takeaway: Good grief, the Wikidata Query Service needs a lot of memory to start up! It wouldn’t run cleanly until I upgraded my docker box from 4GB memory to 8GB. This also doubled the monthly cost of running this little experiment with my web host, but… /me shrugs

Once my test box had a sufficient amount of RAM, the compose command ran cleanly, and I could load the frontends of both the containerized wikibase instance, and the containerized wikidata query service. I did not expect to get that far before lunch. Unfortunately, after lunch it rapidly became clear to me that the wikidata query service wasn’t *quite* connected up to the wikibase instance: Confusingly, the typeahead in the query helper UI could get objects and properties, but no data was ever returned upon running an actual query.

SPARQL isn’t exactly something I’m comfortable with either at this exact moment. Not knowing if it was misconfigured machines or my own inability to write a well-formed SPARQL query, I called in some expert help who was very helpfully watching his email (thanks, Stas!).

For the readers also uncomfortable with SPARQL, here’s an easy query to test if your wikidata query service is actually talking to anything or not:

SELECT * WHERE { ?x ?y ?z } Limit 10

Turns out that even though the query helper typeaheads work like everything is wired up correctly, my containerized wdqs instance isn’t loading data updates from the adjacent wikibase container. I destroyed those containers and remade them just for fun (isn’t that what containers are for?), and while it did not magically fix the issue, it was genuinely entertaining for a minute.

I did clear up a misconception I’d been carrying with me for a while: The query service gets its data by monitoring the recent changes feed on your target wiki. And here I’d been thinking there was some kind of db dump and import on a cron job I’d have to set up eventually. I’m honestly a little surprised that’s not the case, and now I’m wondering what options exist for recovery/rebuilding if your wdqs instance walks off into outer space…

Tomorrow’s Plan:

  • Learn everything about docker. Particularly, if there is a nice way to have containers land their logs somewhere easily accessible. But also everything else.
  • Get the test containers in the wikibase group to talk to eachother the way they are supposed to
  • Write my own compose file to forego the containerized wikibase instance, and instead point all the query service containers to my real pre-existing plantdata wikibase install
  • Verify that the instances are communicating the way you think they are
  • Start populating data in the real plantdata wikibase instance (Data In)
  • Get comfortable with SPARQL and write some sample queries (Data Out)

I’ll be delighted if I accomplish two of those six things tomorrow.

Today’s most useful links: