I’ve been quiet for a while now with no updates. I haven’t been sitting on my thumbs though. As is often the case a lot is happening that you don’t see. First I’ve been working with Krzysztof, my Polish super hero friend (seriously he speaks and writes Polish there isn’t anything more badass than that), on getting the website translation functionality working and there is now a Polish translation of the QW website.
Second, I’ve been looking into LanguageTool, a Java toolkit for analyzing and correcting text, and trying to work out whether I could or should use it for QW. It supports a number of languages and offers rules and spellchecking for most of them. After doing some research I determined that yes I could use it for QW, if only for the spellchecking, and yes I probably should use it since it offers other benefits like word tagging and proper sentence tokenizing (both are important for the Problem Finder and accurate word counts).
But as is often the case once you start to scratch beneath the surface of a thing you find problems. Now I’m going to get technical from here and start talking about code implementation and Java concepts specifically. So if you aren’t interested in that the tl;dr is that I’ve been busy modifying LanguageTool so I can use it in QW but I’m still a ways off completing this task due to the complexity and volume of work. Also, and this is a shameless plug, I self-published a story to Draft2Digital, see https://www.books2read.com/b/bwYLE0 if you fancy a read of a quick story, it’s only about 800 words. All feedback is welcome.
So moving on. All code is built on assumptions, QW is no different to LanguageTool in this regard. For example QW assumes you are using a Windows desktop and have access to a file system. This is why it won’t work on mobile devices and doesn’t have cloud saves (yet, watch this space).
LanguageTool has its own assumptions and a large part of my time lately has been spent removing those assumptions so that it can be used in contexts outside of those the developers had in mind.
The biggest of these assumptions is that all languages you’ll ever use will be available to you at all times. This manifests itself by LanguageTool’s assumption that it will be running in a server environment where storage is not an issue and download size isn’t a problem. That is, LanguageTool in a server environment can happily have all languages it supports and all files for those languages available to it. But for that assumption to be true for QW then the .exe download for QW would bloat up to around 200MB. With over 150MB of that being devoted to language information. i.e. every QW user would have every language in their QW installation regardless of how many languages they would ever use.
Setting aside the download issues this would bloat my server costs by a factor of 2/3 per month. Also how many of you would need a spellchecker for Esperanto or Asturian? It would be nice to be able to offer spellchecking for those languages but should every user have to download the spellchecker for them? Interestingly my Chrome spellchecker is telling me that Asturian is spelled wrong which gives you an idea of how widely used that language is…
Now I’ve modified LanguageTool to remove the “I’m operating on a server” assumption and I can now load language “packs” on an ad-hoc basis. However LanguageTool assumes a number of other things that I felt also needed to change (here I am talking about my use case, I assume that the developers of LanguageTool are happy with their current code and assumptions and I’m not going to tell them otherwise). For instance LanguageTool assumes that all information for a language will exist in a jar file. A jar file is Java’s version of a zip file. It has a number of special features that allow it to be a repository of both code and data. This is convenient when you want to package everything up into a single source and don’t want to worry about the details of where things are but it makes changing anything or using different sources a real pain in the backside.
For instance, if I want to add words or phrases or rules to LanguageTool I can’t just modify a file on my file system I have to download LanguageTool modify the file in their distribution, package things up and then replace the jar file for that language. This is time consuming and annoying so I’ve been changing LanguageTool to be agnostic about where it gets its information from. Data is now retrieved from a “data broker” which is itself an interface and thus doesn’t care about where it gets the data from. So when a language asks for dictionaries it asks its data broker implementation which then knows to get the data from “somewhere”.
This allows the data to be decoupled from the jar file and thus said data can be gained from wherever your data broker implementation decides. This could be the file system, a jar file, a database, binary objects, a remote server, the moon, Donald Trump, whatever.
There are other assumptions that I also want to change, such as the way that languages in LanguageTool create and return rules. LanguageTool assumes that you want all rules for a language and then you’ll “switch off” the ones you aren’t interested in or switch them back on when you want them. This means that all the rules are created when you first want to use the language. Some of these rules are extremely complex and require a lot of setup so creating everything is a time consuming thing to do (especially when you may not ever use those rules). So I’ve also been removing this assumption so you can cherry pick the rules you want and can ignore those you don’t.
The trouble with all this change is that it takes time. LanguageTool has been around for over a decade and has support for over 20 languages with some languages having multiple dialects. Changing things over to remove these assumptions has been a big task that I’ve nearly completed which is why I’m writing this now. I’ve been trying to minimize the impact that these changes will have on existing users of LanguageTool (if they are wanting to switch over) and trying to ensure that everything still works. However there are some major issues that have been difficult to work around, such as some languages missing important files or tests being out of date and no longer working. But I’m nearly there, ultimately I’d like users of QW to have a better spellchecker (a problem I’ve been trying to solve for years now and I still remember the angry email I got from one user who lamented the removal of English dialects from the spellchecker) and I’d also like to be able to add more rules for checking text.
So lots of activity is happening but as ever with software development you don’t see much of it.
For those interested in my changes to LanguageTool see my repository on gihub.