Better tests
What makes for a good test? I feel like there is a dearth of useful literature on this subject, perhaps because a lot of the content which ostensibly aims to answer it ends up going down one of the myriad rabbit holes that people who are passionate about testing often go down — whether it's TDD or BDD or property-based testing or whatever. I don't want to do that. I want to write a few bullet points instead.
The target audience: junior engineers who understand the why and the what of tests, but have not seen enough of them in mature engineering organizations to really grasp the how.
Tests should be as short as possible
More specifically: tests should be as short as possible relative to the uniqueness of the behavior they're trying to test.
Let me give you a simple example. Say we want to test that patching Buttondown's API with an empty subject fails if and only if the status of the email is not draft. Let's pretend we haven't written any tests for this API before. You start off by catching the base case:
def test_patching_email_with_empty_subject_fails_for_sent_email():
user = User.objects.create(username="test")
api_key = APIKey.objects.create(user=user)
client = APIClient()
client.credentials(HTTP_AUTHORIZATION=f"Token {api_key.key}")
email = Email.objects.create(
owner=user,
subject="Hello",
status=EmailStatus.SENT,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == 400
Then your coworker mentions that this test would still pass if you always threw that error — you need to verify the draft case too. So you copy and paste it, change the status, and check the other side of the boundary:
def test_patching_email_with_empty_subject_succeeds_for_draft_email():
user = User.objects.create(username="test")
api_key = APIKey.objects.create(user=user)
client = APIClient()
client.credentials(HTTP_AUTHORIZATION=f"Token {api_key.key}")
email = Email.objects.create(
owner=user,
subject="Hello",
status=EmailStatus.DRAFT,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == 200
Maybe you add two more for SENDING and IMPORTED to cover all the bases. Now you have four near-identical tests. I purport that this is bad — for two reasons that map to two audiences:
- The reader. At some point, someone will introduce a change that causes one of these tests to fail. They will begin by reading the test in full — or, I suppose, asking their LLM to do the same — at which point they will have to spend at least one nanosecond parsing out the client and authentication logic to figure out whether it's germane to the behavior at hand. Obviously, this is not the end of the world. But nanoseconds are important. And while this use case is trivial, it becomes much less so when the banal setup goes from four lines of code to twenty.
- The writer. I'm rounding up a little bit here when I say that literally every software developer in the world starts writing a test by copying and pasting the closest one to it. When you open this file and see that every single test starts with the same setup, you therefore assume that all subsequent tests should too — either out of laziness or of misplaced virtue. And now the problem persists. And when you end up changing a bit of that authentication logic, you have to change it in every single test.
Your testing framework might have an abstraction tailor-made for this. I'm fond of pytest, which uses fixtures — just pull the common functionality out and let the framework inject it:
@pytest.fixture
def authenticated_client():
user = User.objects.create(username="test")
api_key = APIKey.objects.create(user=user)
client = APIClient()
client.credentials(HTTP_AUTHORIZATION=f"Token {api_key.key}")
return client, user
Now that original test collapses down to just the behavior under test — the fixture name in the function signature is all pytest needs to wire it up:
def test_patching_email_with_empty_subject_fails_for_sent_email(
authenticated_client,
):
client, user = authenticated_client
email = Email.objects.create(
owner=user,
subject="Hello",
status=EmailStatus.SENT,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == 400
Better! But we can go further. A pattern that I'm particularly fond of is parameterization, in which you create a metatest and pass in a bunch of various inputs and outputs to isolate the change as specifically as possible:
@pytest.mark.parametrize(
"status, expected_code",
[
(EmailStatus.DRAFT, 200),
(EmailStatus.SENT, 400),
(EmailStatus.SENDING, 400),
(EmailStatus.IMPORTED, 400),
],
)
def test_patching_email_with_empty_subject(
authenticated_client, status, expected_code
):
client, user = authenticated_client
email = Email.objects.create(
owner=user,
subject="Hello",
status=status,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == expected_code
Now the reader can see, at a glance, exactly what varies and exactly what doesn't. The authentication is someone else's problem. The behavioral boundary — the thing we actually care about — is front and center.
Use snapshots for full-context assertions
Snapshots are increasingly in vogue as a way to, rather than assert a characteristic about a response, assert the entire response, and then fail the test on any change to that overall response. Snapshot libraries also come bundled with tooling around reviewing and regenerating these snapshots.
These are nice, but not wholly necessary for every situation. The main idea is understanding the full context of behavior under assertion. Snapshot tests are great for things where looking at just one facet of an output is necessary but insufficient for understanding the entirety of the behavior. Consider the difference between:
def test_empty_subject_error(authenticated_client):
client, user = authenticated_client
email = Email.objects.create(
owner=user,
subject="Hello",
status=EmailStatus.SENT,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == 400
assert response.json()["code"] == "invalid_subject"
This tells you the endpoint returns a 400 with the right error code. But it tells you nothing about the rest of the response — the human-readable message, the field name, the error structure. If someone changes the error message from "Subject cannot be empty for sent emails" to "lol no", this test still passes. Compare:
def test_empty_subject_error(authenticated_client, snapshot):
client, user = authenticated_client
email = Email.objects.create(
owner=user,
subject="Hello",
status=EmailStatus.SENT,
)
response = client.patch(
f"/api/v1/emails/{email.id}",
{"subject": ""},
)
assert response.status_code == 400
assert response.json() == snapshot
Now the snapshot file captures the entire JSON blob — the code, the message, the field, the structure — and any change to any of it will surface in your diff. API responses, rendered HTML, and standalone components are all great candidates for snapshot testing.
Only test meaningful differences
Referring back to the example I led with: if you fed that behavior to an LLM and asked it to write some tests, chances are it would not just create the four tests that we outlined — it would also create a test for a status of IN_FLIGHT, a test for a status of IMPORTED, and maybe a test to see if the behavior remains the same when you also pass a body alongside the empty subject:
# Please don't do this.
@pytest.mark.parametrize(
"status, subject, body, expected_code",
[
(EmailStatus.DRAFT, "", None, 200),
(EmailStatus.DRAFT, "", "Some body", 200),
(EmailStatus.DRAFT, "Hi", None, 200),
(EmailStatus.DRAFT, "Hi", "Some body", 200),
(EmailStatus.SENT, "", None, 400),
(EmailStatus.SENT, "", "Some body", 400),
(EmailStatus.SENT, "Hi", None, 200),
(EmailStatus.SENDING, "", None, 400),
(EmailStatus.SENDING, "", "Some body", 400),
(EmailStatus.IMPORTED, "", None, 400),
(EmailStatus.IMPORTED, "", "Some body", 400),
# ... and on and on and on ...
],
)
def test_patching_email(
authenticated_client, status, subject, body, expected_code
):
...
This frankly drives me nuts, because it adds noise and volume to your test suite without any real value. If you were to break the behavior in a subsequent commit, you'd suddenly see a dozen tests fail instead of one — and the signal-to-noise ratio of your failure output craters accordingly.
The minimum number of test cases that truly explore the boundary space of the behavior is a good ideal to strive towards. The best PRs are ones that are bug fixes with exactly one regression test that fails beforehand and succeeds afterwards.
Tests should look good
Aesthetics of code is obviously subjective. But I don't think it's controversial to say that some code is messier than others, even in things that don't necessarily violate principles of code organization. Let me pick on Python by offering an example.
Django's default test runner, like Python's unittest, uses — for some bizarre reason — Java-style camelCase assertions:
class TestEmail(TestCase):
def test_something(self):
self.assertEqual(response.status_code, 200)
self.assertIsNone(email.subject)
self.assertContains(response, "Hello")
self.assertQuerysetEqual(
Email.objects.all(),
[self.email],
)
This drives me nuts. I think it is actively painful to read assertQuerysetEqual in the context of a Python codebase. Instead, pytest just offers simple, bare assertions that are exactly as flexible as you would expect them to be:
def test_something():
assert response.status_code == 200
assert email.subject is None
assert "Hello" in response.content.decode()
assert list(Email.objects.all()) == [email]
It's useful to take a little bit of initial time to make sure your tests are clean, because whatever conventions and prior art you establish get calcified into a suite very, very quickly. A junior engineer's first test will look like whatever test they copied — and that's not a flaw in the junior engineer, it's a feature of how codebases work. Make sure the thing they're copying is worth copying.
I know that it's increasingly en vogue to treat your codebase with a hint of nihilism — a sense that every line is, with enough patience and GPUs, destined to be subsumed into the void. I encourage you to resist that urge, if only for selfish reasons. The speed and efficacy of your application is going to be increasingly capped not by the quality of your implementation code, but by the quality of your verification code.