A Flexible, Configurable, Extensible Open Source Package for

Mass AI System Evaluation (currently for Machine Translation)


Omar F. Zaidan

Johns Hopkins University


Department of Computer Science


The Center for Language and Speech Processing


Latest Release: August 23rd, 2011 (v0.20).

Latest Webpage Update: August 23rd, 2011.


1. Overview

MAISE is a package that allows researchers to evaluate the output of their AI system(s) using human judgments collected via Amazon Mechanical Turk ( MAISE is open source, easy to run, and platform-independent. Most importantly, it has been proven to be completely bug-free. :-)


2. Description

Amazon's Mechanical Turk (MTurk) is a virtual marketplace that allows anyone to create and post tasks to be completed by human workers around the globe. Each instance of those tasks, called a Human Intelligence Task (HIT) in MTurk lingo, typically requires human understanding and perception that machines are yet to achieve, hence making MTurk an example of "artificial artificial intelligence," as the developers of MTurk aptly put it. Arguably, the most attractive feature of MTurk is the low cost associated with completing HITs and the speed at which they are completed. Having discovered this venue, many researchers in the fields of artificial intelligence and machine learning (and other fields such as psychology), are using MTurk as a valuable and effective source of annotations, labels, and data, namely the kind requiring human input.


One such kind of data is human evaluation of systems that are attempting to do what humans are good at. For instance, if you construct several systems that perform automatic speech transcription (i.e. converting speech to text), and would like to know how well each of the systems performs, you could create HITs on MTurk that 'showcase' the transcriptions obtained by the different systems, and ask workers to tell you which ones they like and which ones they find inferior. Such human feedback would also be valuable because it would help identify systematic errors and guide future development of your system(s).


The same can be applied to a variety of tasks besides speech transcription, such as machine translation, object recognition, emotion detection, etc.


The aim of the MAISE package is to streamline the process of creating those evaluation tasks and uploading the relevant content to MTurk to be judged, without having to familiarize and involve oneself with the mechanics, if you will, of Mechanical Turk. This would allow you to spend more time worrying about improving your system rather than dealing with file input and output and MTurk's sometimes finicky interface.


Note: At the moment, MAISE is designed to aid the evaluation of machine translation (MT) systems. However, it can be used for other AI/ML tasks as well. Please see the FAQ.


3. Download, Licensing, and Citation

MAISE's source code, instructions, documentation, and a tutorial are all included in the distribution.

  • MAISE v0.20, released August 23rd, 2011.
  • MAISE v0.10, released November 1st, 2010.


MAISE is an open-source tool, licensed under the terms of the GNU Lesser General Public License (LGPL). Therefore, it is free for personal and scientific use by individuals and/or research groups. It may not be modified or redistributed, publicly or privately, unless the licensing terms are observed. If in doubt, contact the author for clarification and/or an explicit permission.


If you use MAISE in your work, please cite the software package and include the URL in your paper.


4. The Mechanics of MAISE (Abbreviated Version)

MAISE is quite easy to use. There are a couple of Java programs to compile, but there is no need to install anything, mess with environment variables, etc. Whenever MAISE needs to communicate with MTurk, it will rely on MTurk's Java API, which you can download for free (it too requires no installation, as you will read in MAISE's documentation). Once you create your evaluation tasks and upload the necessary content to MTurk, workers will begin to complete the corresponding HITs. On a regular (e.g. daily) basis, you will tell MAISE to retrieve the new judgments that workers provided since the last time MAISE checked. The process continues until either all your tasks are completed, or you decide you have enough judgments.


You can use MAISE with any evaluation setup you like, as long as you design the user interface for it. Currently, MAISE comes with existing support for a particular evaluation setup that asks annotators to rank the outputs of different systems relative to each other. When we say "existing support" we mean the user interface is included, and so is an analysis tool that can make sense of the judgments. This way, and you don't need to do anything extra to obtain rankings of the systems. You can read more about this evaluation setup in the overview papers of the Workshop on Statistical Machine Translation (WMT) for the past two years.


5. FAQ

Q: It looks like MAISE was written for machine translation, not general ML/AI tasks. Can I really use MAISE for my task?

A: True, when MAISE was being written (before it was even named MAISE), it was meant to aid evaluating MT systems. However, it can be used for other ML/AI tasks as well, but some of the supported features may not make sense for your task, and you'll have to pretend they're not there. Basically, you'll also have to trick MAISE, telling it silly things like which languages you're translating to and from, only because it expects you to tell it. It's fairly simple to do so, but I would be more than willing to help you get started if you're not sure how. (This is the main reason why MAISE is not in version 1.0 yet...I promise you MAISE v1.0 will not need to be tricked into thinking it's doing MT!)


Q: Why did you develop MAISE?

A: My advisor at JHU, Chris Callison-Burch, recruited me to help him run the manual evaluation component of WMT10. The data processing components of MAISE mirror a previous Perl implementation by Josh Schroeder, who helped with the manual before I got on board. I then extended the code so it can communicate with MTurk, hence allowing us to utilize workers from around the globe, rather than having to personally recruit annotators to use an internal evaluation setup, which was previously the case. Following WMT10, I cleaned up my code, added a couple of features, made it (much) more user friendly, and wrote documentation for it, to aid whoever runs future WMT manual evaluations, and so that other ML/AI researchers can use it to obtain human judgments.


Q: The WMT manual evaluation involved a couple of other tasks. Are they supported in MAISE?

A: Not yet. I first focused on getting the ranking task supported, since it's the main evaluation setup we used in WMT, and in all likelihood is the evaluation setup other researchers would be most interested in, out of the tasks we had in WMT. I hope to get the other tasks supported in the future, though I don't have a particular idea of when that might happen.


Q: I have a feeling that you'd like to thank some people. So I'm going to let you...

A: Thanks for letting me thank some people. (See what I did there?) For starters, I developed MAISE while I was funded by DARPA's GALE Program, and in part by the EuroMatrixPlus Project. I would like to thank Chris Callison-Burch, Ondrej Bojar, and everybody who gave feedback during the WMT10 evaluation campaign. More importantly, much thanks goes to Josh Schroeder, who helped me navigate his code when I first started reimplementing the data processing components.


Q: I'd like to offer you a job. Are you interested?

A: Yes! Check my website for my CV and publications, and drop me a line:


Q: But you didn't ask me what kind of job it is...

A: I have questionable morals.


6. History

Note: Version changes in the first decimal place (e.g. v1.05 to v1.10) reflect significant changes, such as changes in functionality or use. Changes in the second decimal place (e.g. v1.23 to v1.24) reflect minor changes in the documentation, instructions, output, etc.


v0.20 (8/23/11)

       Added more MTurk functions.

       Added more options to existing functions, particularly the Retriever module.


v0.10 (11/1/10)

       Initial release!


7. References

(coming soon)