Clearly, I haven’t been keeping up with the blog on a daily basis as I initially intended. This isn’t because I’m stuck, or not working on things; It’s more like I don’t want to stop working on the actual problems in order to distract myself with blog posts. This probably has something to do with the fact that I’ve finally gotten to the part where I get to do some coding, and, well, I get in the zone. So, from here on out, I’m going to aim for every week or so on updates.
The last seven project days have largely been spent pouring over plant species items in Wikidata, making some decisions about what I want to keep on the plantdata instance of wikibase, and getting a start on writing a wikibase data importer tool.
As a person who has spent many years designing and modifying data systems, I have a horrible allergic reaction whenever I see too much data duplicated when a reference would work. So, how much duplication is too much? In my experience: Most of it, unless not duplicating is preventing cool queries from being possible. For that reason, I’ve decided not to import much at all in the way of species-level property data from Wikidata. At the moment, the plan is to pull in five properties and the item ID for all plant species, which is something like 471,000 Wikidata items.
In the process of making that decision, I created a spreadsheet of common properties used on Wikidata plants, and what I wanted to do with them initially. Due to volume and my aforementioned allergic reactions to data duplication, I will not be pulling in any of the external db reference IDs, but Wow: Some of them would make excellent sources for doing research and populating the data do I want to keep here. It would certainly be worthwhile to surface these as possible sources to reference when users discover and wish to fill in holes in the data.
In regards to the data that I do want to keep: In another tab, I was working out what data I want to keep on plantdata.io for every species, variety, and cultivar, roughly grouped by something that’s almost (but not quite) use case.
A New Data Importer
Somewhere in there, I also got started writing the plantdata-importer script: A new command-line tool to import .tsv files to the wikibase instance of your choice. “Started” is the keyword there, though: At present, It doesn’t do anything beyond verifying that it can do some simple item and property searches (to make sure you’re not about to duplicate thousands of items, for instance…), log in to the specified wikibase instance, and instantiate a WikibaseFactory object (defined in the wikibase-api library) which will eventually do the actual edits.
I’d very much like this tool to attempt some reasonable property mapping decisions around column headers in the import files, so that’s the part of the tool I’m focused on at the moment. As soon as this thing works, it’s import time.
- Finish the plantdata-importer.
- Start populating data in the real plantdata wikibase instance.
- Get comfortable with SPARQL and write some sample queries.
- Start thinking about what we want on the main plantdata.io site. Specific standard ways to search for plants and environmental data in a friendly interface, maybe a lightweight editor to fill holes in the data, that kind of thing.