Tuesday, April 18, 2017

A better March Madness script?

Last year, I wrote an article for Linux Journal describing how to create a Bash script to build your NCAA "March Madness" brackets. I don't really follow basketball, but I have friends that do, so by filling out a bracket at least I can have a stake in the games.

Since then, I realized my script had a bug that prevented any rank 16 team from winning over a rank 1 team. So this year, I wrote another article for Linux Journal with an improved Bash script to build a better NCAA "March Madness" bracket. In brief, the updated script builds a custom random "die roll" based on the relative strength of each team. My "predictions" this year are included in the Linux Journal article.

Since the games are now over, I figured this was a great time to see how my bracket performed. If you followed the games, you know that there were a lot of upsets this year. No one really predicted the final two teams for the championship. So maybe I shouldn't be too surprised if my brackets didn't do well either. Next year might be a better comparison.

In the first round of the NCAA March Madness, you start with teams 1–16 in four regions, so that's 64 teams that compete in 32 games. In that "round of 64," my shell script correctly predicted 21 outcomes. That's not a bad start.

March Madness is single-elimination, so for the second round, you have 32 teams competing in 16 games. My shell script correctly guessed 7 of those games. So just under half were predicted correctly. Not great, but not bad.

In the third round, my brackets suffered. This is the "Sweet Sixteen" where 16 teams compete in 8 games, but my script only predicted 2 of those games.

And in the fourth round, the "Elite Eight" round, my script didn't predict any of the winners. And that wrapped up my brackets.

Following the standard method for how to score "March Madness" brackets, each round has 320 possible points. In round one, assign 10 points for each correctly selected outcome. In round two, assign 20 points for each correct outcome. And so on, double the possible points at each round. From that, the math is pretty simple.

round one:21 × 10 =210
round two:7 × 20 =140
round three:1 × 40 =40
round four:0 × 80 =0
390
My total score this year is 390 points. As a comparison, last year's script (the one with the bug) scored 530 in one instance, and 490 in another instance. But remember that there were a lot of upsets in this year's games, so everyone's brackets fared poorly this year, anyway.

Maybe next year will be better.

Did you use the Bash script to help fill out your "March Madness" brackets? How did you do?

Monday, April 3, 2017

How many testers do you need?

When you start a usability test, the first question you may ask is "how many testers do I need?" The standard go-to article on this is Nielsen's "Why You Only Need to Test with 5 Users" which gives the answer right there in the title: you need five testers.

But it's important to understand why Nielsen picks five as the magic number. MeasuringU has a good explanation, but I think I can provide my own.

The core assumption is that each tester will uncover a certain amount of issues in a usability test, assuming good test design and well-crafted scenario tasks. The next tester will uncover about the same amount of usability issues, but not exactly the same issues. So there's some overlap, and some new issues too.

If you've done usability testing before, you've observed this yourself. Some testers will find certain issues, other testers will find different issues. There's overlap, but each tester is on their own journey of discovery.

How many usability issues is up for some debate. Nielsen uses his own research and asserts that a single tester can uncover about 31% of the usability issues. Again, that assumes good test design and scenario tasks. So one tester finds 31% of the issues, the next tester finds 31% but not the same 31%, and so on. With each tester, there's some overlap, but you discover some new issues too.

In his article, Nielsen describes a function to demonstrate the number of usability issues found vs the number of testers in your test, for a traditional formal usability test:
1-(1-L)n

…where L is the amount of issues one tester can uncover (Nielsen assumes L=31%) and n is the number of testers.

I encourage you to run the numbers here. A simple spreadsheet will help you see how the value changes for increasing numbers of testers. What you'll find is a curve that grows quickly then slowly approaches 100%.


Note at five testers, you have uncovered about 85% of the issues. Nielsen's curve suggests a diminishing return at higher numbers of testers. As you add testers, you'll certainly discover more usability issues, but the increment gets smaller each time. Hence Nielsen's recommendation for five testers.

Again, the reason that five is a good number is because of overlap of results. Each tester will help you identify a certain number of usability issues, given a good test design and high quality scenario tasks. The next tester will identify some of the same issues, plus a few others. And as you add testers, you'll continue to have some overlap, and continue to expand into new territory.

Let me help you visualize this. We can create a simple program to show this overlap. I wrote a Bash script to generate SVG files with varying numbers of overlapping red squares. Each red square covers about 31% of the gray background.


If you run this script, you should see output that looks something like this, for different values of n. Each image starts over; the iterations are not additive:

n=1

n=2

n=3

n=4

n=5

n=10

n=15

As you increase the number of testers, you cover more of the gray background. And you also have more overlap. The increase in coverage is quite dramatic from one to five, but compare five to fifteen. Certainly there's more coverage (and more overlap) at ten than at five, but not significantly more coverage. And the same going from ten to fifteen.

These visuals aren't meant to be an exact representation of the Nielsen iteration curve, but they do help show how adding more testers gives significant return up to a point, and then adding more testers doesn't really get you much more.

The core takeaway is that it doesn't take many testers to get results that are "good enough" to improve your design. The key idea is that you should do usability testing iteratively with your design process. I think every usability researcher would agree. Ellen Francik, writing for Human Factors, refers to this process as the Rapid Iterative Testing and Evaluation (RITE) method, arguing "small tests are intended to deliver design guidance in a timely way throughout development." (emphasis mine)

Don't wait until the end to do your usability tests. By then, it's probably too late to make substantive changes to your design, anyway. Instead, test your design as you go: create (or update) your design, do a usability test, tweak the design based on the results, test it again, tweak it again, and so on. After a few iterations, you will have a design that works well for most users.

Sunday, April 2, 2017

A throwback theme for gedit

This isn't exactly about usability, but I wanted to share it with you anyway.

I've been involved in a lot of open source software projects, since about 1993. You know that I'm also the founder and coordinator of the FreeDOS Project? I started that project in 1994, to write a free version of DOS that anyone could use.

DOS is an old operating system. It runs entirely in text mode. So anyone who was a DOS user "back in the day" should remember text mode and the prevalence of white-on-blue text.

For April 1, we used a new "throwback" theme on the FreeDOS website. We rendered the site using old-style DOS colors, with a monospace DOS VGA font.

Even though the redesign was meant only for a day, I sort of loved the new design. This made me nostalgic for using the DOS console: editing text in that white-on-blue, without the "distraction" of other fonts or the glare of modern black-on-white text.

So I decided to create a new theme for gedit, based on the DOS throwback theme. Here's a screenshot of gedit editing a Bash script, and editing the XML theme file itself:



The theme uses the same sixteen color palette from DOS. You can find the explanation of  why DOS has sixteen colors at the FreeDOS blog. I find the white-on-blue text to be calming, and easy on the eyes.

Of course, to make this a true callback to earlier days of computing, I used a custom font. On my computer, I used Mateusz Viste's DOSEGA font. Mateusz created this font by redrawing each glyph in Fontforge, using the original DOS CPI files as a model. I think it's really easy to read. (Download DOSEGA here: dosega.zip)

Want to create this on your own system? Here's the XML source to the theme file. Save this in ~/.local/share/gtksourceview-3.0/styles/dosedit.xml and gedit should find it as a new theme.
<?xml version="1.0" encoding="UTF-8"?>
<!--
  reference: https://developer.gnome.org/gtksourceview/stable/style-reference.html
-->
<style-scheme id="dos-edit" name="DOS Edit" version="1.0">
<author>Jim Hall</author>
<description>Color scheme using DOS Edit color palette</description>
<!--
  Emulate colors used in a DOS Editor. For best results, use a monospaced font
  like DOSEGA.
-->

<!-- Color Palette -->

<color name="black"           value="#000"/>
<color name="blue"            value="#00A"/>
<color name="green"           value="#0A0"/>
<color name="cyan"            value="#0AA"/>
<color name="red"             value="#A00"/>
<color name="magenta"         value="#A0A"/>
<color name="brown"           value="#A50"/>
<color name="white"           value="#AAA"/>
<color name="brightblack"     value="#555"/>
<color name="brightblue"      value="#55F"/>
<color name="brightgreen"     value="#5F5"/>
<color name="brightcyan"      value="#5FF"/>
<color name="brightred"       value="#F55"/>
<color name="brightmagenta"   value="#F5F"/>
<color name="brightyellow"    value="#FF5"/>
<color name="brightwhite"     value="#FFF"/>

<!-- Settings -->

<style name="text"                 foreground="white" background="blue"/>
<style name="selection"            foreground="blue" background="white"/>
<style name="selection-unfocused"  foreground="black" background="white"/>

<style name="cursor"               foreground="brown"/>
<style name="secondary-cursor"     foreground="magenta"/>

<style name="current-line"         background="black"/>
<style name="line-numbers"         foreground="black" background="white"/>
<style name="current-line-number"  background="cyan"/>

<style name="bracket-match"        foreground="brightwhite" background="cyan"/>
<style name="bracket-mismatch"     foreground="brightyellow" background="red"/>

<style name="right-margin"         foreground="white" background="blue"/>
<style name="draw-spaces"          foreground="green"/>
<style name="background-pattern"   background="black"/>

<!-- Extra Settings -->

<style name="def:base-n-integer"   foreground="cyan"/>
<style name="def:boolean"          foreground="cyan"/>
<style name="def:builtin"          foreground="brightwhite"/>
<style name="def:character"        foreground="red"/>
<style name="def:comment"          foreground="green"/>
<style name="def:complex"          foreground="cyan"/>
<style name="def:constant"         foreground="cyan"/>
<style name="def:decimal"          foreground="cyan"/>
<style name="def:doc-comment"      foreground="green"/>
<style name="def:doc-comment-element" foreground="green"/>
<style name="def:error"            foreground="brightwhite" background="red"/>
<style name="def:floating-point"   foreground="cyan"/>
<style name="def:function"         foreground="cyan"/>
<style name="def:heading0"         foreground="brightyellow"/>
<style name="def:heading1"         foreground="brightyellow"/>
<style name="def:heading2"         foreground="brightyellow"/>
<style name="def:heading3"         foreground="brightyellow"/>
<style name="def:heading4"         foreground="brightyellow"/>
<style name="def:heading5"         foreground="brightyellow"/>
<style name="def:heading6"         foreground="brightyellow"/>
<style name="def:identifier"       foreground="brightyellow"/>
<style name="def:keyword"          foreground="brightyellow"/>
<style name="def:net-address-in-comment" foreground="brightgreen"/>
<style name="def:note"             foreground="green"/>
<style name="def:number"           foreground="cyan"/>
<style name="def:operator"         foreground="brightwhite"/>
<style name="def:preprocessor"     foreground="brightcyan"/>
<style name="def:shebang"          foreground="brightgreen"/>
<style name="def:special-char"     foreground="brightred"/>
<style name="def:special-constant" foreground="brightred"/>
<style name="def:specials"         foreground="brightmagenta"/>
<style name="def:statement"        foreground="brightmagenta"/>
<style name="def:string"           foreground="brightred"/>
<style name="def:type"             foreground="cyan"/>
<style name="def:underlined"       foreground="brightgreen"/>
<style name="def:variable"         foreground="cyan"/>
<style name="def:warning"          foreground="brightwhite" background="brown"/>

</style-scheme>

Friday, March 31, 2017

Screencasts for usability testing

There's nothing like watching a real person use your software to finally understand the usability issues your software might have. It's hard to get that kind of feedback through surveys or other indirect methods. I find it's best to moderate a usability test with a few testers who run through a set of scenario tasks. By observing how they attempt to complete the scenario tasks, you can learn a lot about how real people use your software to do real tasks.

Armed with that information, you can tweak the user interface to make it easier to use. Through iteration (design, test, tweak, test, tweak, etc) you can quickly find a design that works well for everyone.

The simple way to moderate a usability test is to watch what the user is doing, and take notes about what they do. I recommend the "think aloud" protocol, where you ask the tester to talk about what they are doing. If you're looking for a Print button, just say "I'm looking for a Print button" so I can make note of that. And move your mouse to where you are looking, so I can see what you are doing and where you are looking. In my experience, testers adapt to this fairly quickly.

In addition to taking your own notes, you might try recording the test session. That allows you to go back to the recording later to see exactly what the tester was doing. And you can share the video with other developers in your project, so they can watch the usability test sessions.

Screencasts are surprisingly easy to do, at least under Linux. The GNOME desktop has a built-in screencast function, to capture a video of the computer's screen.

But if you're like me, you may not have known this feature existed. It's kind of hard to get to. Press Ctrl+Alt+Shift+R to start recording, then press Ctrl+Alt+Shift+R again to stop recording.

If that's hard for you to remember, there's also a GNOME Extension called EasyScreenCast that, as the name implies, makes screencasts really easy. Once you install the extension, you get a little menu that lets you start and stop recording, as well as set options. It's very straightforward. You can select a sound input, to narrate what you are dong. And you can include webcam video, for a picture-in-picture video.

Here's a sample video I recorded as part of the class that I'm teaching. I needed a way to walk students through the steps to activate Notebookbar View in LibreOffice 5.3. I also provided written steps, but there's nothing like showing rather than just explaining.



With screencasts, you can extend your usability testing. At the beginning of your session, before the tester begins the first task, start recording a screencast. Capture the audio from the laptop's microphone, too.

If you ask your tester to follow the "think aloud" protocol, the screencast will show you the mouse cursor, indicating where the tester is looking, and it will capture the audio, allowing you to hear what the tester was thinking. That provides invaluable evidence for your usability test.

I admit I haven't experimented with screencasts for usability testing yet, but I definitely want to do this the next time I mentor usability testing for Outreachy. I find a typical usability test can last upwards of forty-five minutes to an hour, depending on the scenario tasks. But if you have the disk space to hold the recording, I don't see why you couldn't use the screencast to record each tester in your usability test. Give it a try!

Monday, March 27, 2017

Testing LibreOffice 5.3 Notebookbar

I teach an online CSCI class about usability. The course is "The Usability of Open Source Software" and provides a background on free software and open source software, and uses that as a basis to teach usability. The rest of the class is a pretty standard CSCI usability class. We explore a few interesting cases in open source software as part of our discussion. And using open source software makes it really easy for the students to pick a program to study for their usability test final project.

I structured the class so that we learn about usability in the first half of the semester, then we practice usability in the second half. And now we are just past the halfway point.

Last week, my students worked on a usability test "mini-project." This is a usability test with one tester. By itself, that's not very useful. But the intention is for the students to experience what it's like to moderate their own usability test before they work on their usability test final project. In this way, the one-person usability test is intended to be a "dry run."

For the one-person usability test, every student moderates the same usability test on the same program. We are using LibreOffice 5.3 in Notebookbar View in Contextual Groups mode. (And LibreOffice released version 5.3.1 just before we started the usability test, but fortunately the user interface didn't change, at least in Notebookbar-Contextual Groups.) Students worked together to write scenario tasks for the usability test, and I selected eight of those scenario tasks.

By using the same scenario tasks on the same program, with one tester each, we can combine results to build an overall picture of LibreOffice's usability with the new user interface. Because the test was run by different moderators, this isn't statistically useful if you are writing an academic paper, and it's of questionable value as a qualitative measure. But I thought it would be interesting to share the results.

First, let's look at the scenario tasks. We started with one persona: an undergraduate student at a liberal arts university. Each student in my class contributed two use scenarios for LibreOffice 5.3, and three scenario tasks for each scenario. That gave a wide field of scenario tasks. There was quite a bit of overlap. And there was some variation on quality, with some great scenario tasks and some not-so-great scenario tasks.

I grouped the scenario tasks into themes, and selected eight scenario tasks that suited a "story" of a student working on a paper: a simple lab write-up for an Introduction to Physics class. I did minimal editing of the scenario tasks; I tried to leave them as-is. Most of the scenario tasks were of high quality. I included a few not-great scenario tasks so students could see how the quality of the scenario task can impact the quality of your results. So keep that in mind.

These are the scenario tasks we used. In addition to these tasks, students provided a sample lab report (every tester started with the same document) and a sample image. Every test was run in LibreOffice 5.3 or 5.3.1, which was already set to use Notebookbar View in Contextual Groups mode:
1. You’re writing a lab report for your Introduction to Physics class, but you need to change it to meet your professors formatting requirements. Change your text to use Times New Roman 12 pt. and center your title

2. There is a requirement of double spaced lines in MLA. The paper defaults to single spaced and needs to be adjusted. Change paper to double spaced.

3. After going through the paragraphs, you would like to add your drawn image at the top of your paper. Add the image stored at velocitydiagram.jpg to the top of the paper.

4. Proper header in the Document. Name, class, and date are needed to receive a grade for the week.

5. You've just finished a physics lab and have all of your data written out in a table in your notebook. The data measures the final velocity of a car going down a 1 meter ramp at 5, 10, 15, 20, and 25 degrees. Your professor wants your lab report to consist of a table of this data rather than hand-written notes. There’s a note in the document that says where to add the table.

[task also provided a 2×5 table of sample lab data]

6. You are reviewing your paper one last time before turning it into your professor. You notice some spelling errors which should not be in a professional paper. Correct the multiple spelling errors.

7. You want to save your notes so that you can look back on them when studying for the upcoming test. Save the document.

8. The report is all done! It is time to turn it in. However, the professor won’t accept Word documents and requires a PDF. Export the document as a PDF.
If those don't seem very groundbreaking, remember the point of the usability test "mini-project" was for the students to experience moderating their own usability test. I'd rather they make mistakes here, so they can learn from them before their final project.

Since each usability test was run with one tester, and we all used the same scenario tasks on the same version of LibreOffice, we can collate the results. I prefer to use a heat map to display the results of a usability test. The heat map doesn't replace the prose description of the usability test (what worked v what were the challenges) but the heat map does provide a quick overview that allows focused discussion of the results.

In a heat map, each scenario task is on a separate row, and each tester is in a separate column. At each cell, if the tester was able to complete the task with little or no difficulty, you add a green block. Use yellow for some difficulty, and orange for greater difficulty. If the tester really struggled to complete the task, use a red block. Use black if the task was so difficult the tester was unable to complete the task.

Here's our heat map, based on fourteen students each moderating a one-person usability test (a "dry run" test) using the same scenario tasks for LibreOffice 5.3 or 5.3.1:


A few things about this heat map:

Hot rows show you where to focus

Since scenario tasks are on rows, and testers are on columns, you read a heat map by looking across each row and looking for lots of "hot" items. Look for lots of black, red, or orange. Those are your "hot" rows. And rows that have a lot of green and maybe a little yellow are "cool" rows.

In this heat map, I'm seeing the most "hot" items in setting double space (#2), adding a table (#5) and checking spelling (#6). Maybe there's something in adding a header (#4) but this scenario task wasn't worded very well, so the problems here might be because of the scenario task.

So if I were a LibreOffice developer, and I did this usability test to examine the usability of MUFFIN, I would probably put most of my focus to make it easier to set double space, add tables, and check spelling. I wouldn't worry too much about adding an image, since that's mostly green. Same for saving, and saving as PDF.

The heat map doesn't replace prose description of themes

What's behind the "hot" rows? What were the testers trying to do, when they were working on these tasks? The heat map doesn't tell you that. The heat map isn't a replacement for prose text. Most usability results need to include a section about "What worked well" and "What needs improvement." The heat map doesn't replace that prose section. But it does help you to identify the areas that worked well vs the areas that need further refinement.

That discussion of themes is where you would identify that task 4 (Add a header) wasn't really a "hot" row. It looks interesting on the heat map, but this wasn't a problem area for LibreOffice. Instead, testers had problems understanding the scenario task. "Did the task want me to just put the text at the start of the document, or at the top of each page?" So results were inconsistent here. (That was expected, as this "dry run" test was a learning experience for my students. I intentionally included some scenario tasks that weren't great, so they would see for themselves how the quality of their scenario tasks can influence their test.)

Different versions are grouped together

LibreOffice released version 5.3.1 right before we started our usability test. Some students had already downloaded 5.3, and some ended up with 5.3.1. I didn't notice any user interface changes for the UI paths exercised by our scenario tasks, but did the new version have an impact?

I've sorted the results based on 5.3.1 off to the right. See the headers to see which columns represent LibreOffice 5.3 and which are 5.3.1. I don't see any substantial difference between them. The "hot" rows from 5.3 are still "hot" in 5.3.1, and the "cool" rows are still "cool."

You might use a similar method to compare different iterations of a user interface. As your program progresses from 1.0 to 1.1 to 1.2, etc, you can compare the same scenario tasks by organizing your data in this way.

You could also group different testers together

The heat map also lets you discuss testers. What happened with tester #7? There's a lot of orange and yellow in that column, even for tasks (rows) that fared well with other testers. In this case, the interview revealed that tester was having a bad day, and came into the test feeling "grumpy" and likely was impatient about any problems encountered in the test.

You can use these columns to your advantage. In this test, all testers were drawn from the same demographic: a university student around 18-22 years old, who had some to "moderate" experience with Word or Google Docs, but not LibreOffice.

But if your usability test intentionally included a variety of experience levels (a group of "beginner" users, "moderate" users, and "experienced" users) you might group these columns appropriately in the heat map. So rather than grouping by version (as above) you could have one set of columns for "beginner" testers, another set of columns for "moderate" testers and a third group for "experienced" testers.

Tuesday, March 21, 2017

LibreOffice 5.3.1 is out

Last week, LibreOffice released version 5.3.1. This seems to be an incremental release over 5.3 and doesn't seem to change the new user interface in any noticeable way.

This is both good and bad news for me. As you know, I have been experimenting with LibreOffice 5.3 since LibreOffice updated the user interface. Version 5.3 introduced the "MUFFIN" interface. MUFFIN stands for My User Friendly Flexible INterface. Because someone clearly wanted that acronym to spell "MUFFIN." The new interface is still experimental, so you'll need to activate it through Settings→Advanced. When you restart LibreOffice, you can use the View menu to change modes.

So on the one hand, I'm very excited for the new release!

But on the other hand, the timing is not great. Next week would have been better. Clearly, LibreOffice did not have my interests in mind when they made this release.

You see, I teach an online CSCI class about the Usability of Open Source Software. Really, it's just a standard CSCI usability class. The topic is open source software because there are some interesting usability cases there that bear discussion. And it allows students to pick their own favorite open source software project that they use in a real usability test for their final project.

This week, we are doing a usability test "mini-project." This is a "dry run" for the students to do their own usability test for the first time. Each student is doing the test with one participant each, but using the same program. We're testing the new user interface in LibreOffice 5.3, using Notebookbar in Contexttual Groups mode.

So we did all this work to prep for the usability test "mini-project" using LibreOffice 5.3, only for the project to release version 5.3.1 right before we do the test. So that's great timing, there.

But I kid. And the new version 5.3.1 seems to have the same user interface path in Notebookbar-Contextual Groups. So our test should bear the same results in 5.3 or 5.3.1.

This is an undergraduate class project, and will not generate statistically significant results like a formal usability test in academic research. But the results of our test may be useful, nonetheless. I'll share an overview of our results next week.

Saturday, March 18, 2017

Will miss GUADEC 2017

Registration is now open for GUADEC 2017! This year, the GNOME Users And Developers European Conference (GUADEC) will be hosted in beautiful Manchester, UK between 28th July and 2nd August.

Unfortunately, I can't make it.

I work in local government, and just like last year, GUADEC falls during our budget time at the county. Our county budget is on a biennium. That means during an "on" year, we make our budget proposals for the next two years. In the "off" year, we share a budget status.

I missed GUADEC last year because I was giving a budget status in our "off" year. And guess what? This year, department budget presentations again happen during GUADEC.

During GUADEC, I'll be making our budget proposal for IT. This is our one opportunity to share with the Board our budget priorities for the next two years, and to defend any budget adjustment. I can't miss this meeting.