Using the DHIS2 pyramid in ODK forms

Jonathan Niles

14 Jan 2025 — 8 min read

One of the hardest elements of data collection, particularly in low resource environments, is to ensure accuracy of the information being recorded. Often, not enough time is dedicated to validation of data collection instruments, training of data collectors, and thinking through data analysis workflows before the deployment of data collection tools. Since many aspects of data collection campaigns are out of your control (timing, recruitment of personnel, etc), it is extremely important that the elements within your control are optimized towards generating the best data possible.

There are several dimensions of data quality (see the WHO's handbook on Data Quality Assurance for a more in-depth treatise), including timeliness, completeness and accuracy. In the case of ODK surveys, the main issues regarding data quality are completeness and accuracy. If a survey takes too long to complete, data collectors may simply skip sections. Similarly, if a survey is too reliant on manual text input, the data accuracy may fall due typos, to use of shorthand, lack of consistent capitalization/punctuation, or inability to faithfully record data with non-English characters.

Therefore, we want to automate as much of the data entry as possible. Instead of asking a user to type in the health facility name in a survey, you should instead pre-populate a list of health facilities and let them choose one! Drop down selects are perhaps the most powerful tool in the data collection arsenal. However, they often need to be accompanied by an option to type a value if the sought value is not contained within the option list. Data can often become stale, or be incorrect to begin with, so providing a user the ability to record the correct information rather than forcing them to choose an incorrect option, is always the best approach.

In this post, we'll look at creating an ODK form for collecting geographic data in the Democratic Republic of the Congo (DRC) using the DHIS2 health pyramid as the geographic reference. As discussed previously, the geographic boundaries in the DRC are Country > Province > Health Zone > Health Area. Health facilities are nestled within health areas. We are often interested in gathering data at the health area level in the DRC, since these boundaries are often the de facto reporting boundaries for other public health statistics though the national health information system (SNIS).

Here is what we hope to accomplish by the end of this post:

Build an ODK form with nested select elements representing geographic boundaries in the DRC.
Data collectors should be able to complete the form by progressively chosing the largest geographic boundaries and be presented with all contained geographic boundaries. That is to say, the form should be filtered such that only health zones are presented for the previously selected province, and so on down the health pyramid, to the health facility level.
Data collectors should have the option to indicate "Not shown" on the ODK form to type in their own health facility if the option is not shown.
All selected options should be linked to their respective DHIS2 identifiers so that we can quickly link data across different datasets.

Setting up the ODK Form

The preambles to all my ODK forms look identical. I always add the start date, end date, and device identifier for reference. These values are part of the XLSForm metadata and are silently added in the background. Here is what the preamble looks like:

type	name	label

start	start	The start date of data collection.
end	end	The end date of data collection.
deviceid	device_id	Device identifier for our records.

These allow me to better track when the data was collected and who collected it, in case there needs to be some follow up clarification.

Configuring the geolocation dropdowns

Next, we'll add a section for the geolocation information. This is where part of the magic happens, since we are going to be using the search() directive to ensure that we can update the CSV file later as needed. Here is the full complete form, showing both the survey sheet and the choices sheet:

type	name	label	relevant	appearance
start	start	The start date of data collection.
end	end	The end date of data collection.
deviceid	device_id	Device identifier for our records.

begin group	geolocation	Lorsque vous êtes devant la structure
select_one provinces	province	Choisissez la province		search('health-facilities')
select_one health_zones	health_zone	Choisissez la zone de santé		search('health-facilities', 'matches', 'province', ${province})
select_one health_areas	health_area	Choisissez l'aire de santé		search('health-facilities', 'matches', 'province', ${province}, 'health_zone', ${health_zone})
select_one health_facility	health_facility	Choisissez la structure		search('health-facilities', 'matches', 'health_zone', ${health_zone}, 'health_area', ${health_area})
text	health_facility_other	Saisissez le nom de la structure	${health_facility} = 'hf_other'
end group	geolocation

The Survey Worksheet

list_name	name	label
provinces	province	province
health_zones	health_zone	health_zone
health_areas	health_area	health_area
villages	village	village

The Choices Worksheet

Let's walk through this form.

I always put identification information in its own block, usually colored with a bright background color to make sure I can easily identify it later. The geolocation block is also given its own background color for visual reference.

Each select_one directive looks like the typical ODK select_one input, except that it has a search() directive in the appearance field, and there is only one entry in the choices worksheet where the typical list of options would appear. This is because the dropdown list choices are not pulled from the choices worksheet, but instead from the filename specified as the first parameter in the search() function. In this case, the filename is health-facilities.csv (the .csv suffix is implied). This is a file we'll have to provide when we upload the form to ODK Central. I prefer to reuse a single file, since it is easier ensure consistency (i.e. so that I don't make a mistake) by having a single file than to have multiple files for provinces, health zones, health areas, etc.

Using a single file means we need to tell ODK which column to treat as the label (to display to the user), which column to treat as the value (to record upon user selection), and which values to use as filters (to narrow down which row to select). The search() documentation is a good reference for these parameters, but I'll cover what is happening above.

In the first case, select_one provinces, we are using search('health-facilities'). This tells ODK to get the drop down values from the health-facilities.csv file. But which values should be displayed? Those are given in the choices worksheet for the list_name corresponding to the keyword provinces. It tells ODK that the column it should display the province CSV column as the drop down label and also use the province CSV column as the value for the selection.

But, what about repeat provinces? Well, thankfully, when duplicate values are encountered, they are ignored. This means that we can have a CSV file with thousands of rows for just a few provinces, and only the unique provinces will be displayed to the data collector. Neat! Incidentally, you cannot use select_one_from_file for this reason - it does not ignore or collapse duplicates.

In the second case, select_one health_zone, we are using search('health-facilities', 'matches', 'province', ${province}). As before, we specify the file in the first parameter, and ODK uses the choices worksheet to figure out what values to display as choices to the user. In addition to those settings, we include parameters to filter the choices presented to the user. The 'matches', 'province', ${province}) portion of search() expression tells ODK to use an exact match on the province column of the CSV to the value of the stored ODK variable ${province} that was set in the previous question. In human words, this means that the choices presented to the data collector will only be health zone choices which are found in the previously selected province.

The third and fourth cases for health_area and health_facility are simply variations of the second case, though with other filter options specified. In a real data collection, you probably would not want to use the same CSV column as both the label and the value of the dropd own menu to avoid collisions due to duplicates. In a future blog post, I'll describe how to modify the above syntax to account for that.

Finally, we tackle how to handle "other" values, or options where the data collector may need to add a value that they didn't find in the list. It's always important to provide this option - sometimes, the data that the form was built with is outdated or incomplete, or the data collector may have accidentally selected a different health zone and became confused why the expected facilities were not showing up in the drop down. In this case, they should input a new value. Not having this value will cause them to either not record the information, or worse, record incorrect information.

In the example form, the "other" category has a value of hf_other. So, if the health_facility selection is hf_other, we'll add a text box to allow a user to type in the health facility name. Notice, however, that the health facility choices have been filtered by province, health zone, and health area. This means we'll need to add a hf_other health facility line to our health-facilities.csv for each province, health zone, and health area in the CSV dataset to ensure that the hf_other option is always displayed. We'll tackle that in the data step.

Getting the DHIS2 Pyramid

We'll use the previous code we've written in R to save the DHIS2 pyramid to a CSV. For reference, here is the code to do that:

# Pull in the DHIS2 org units
OrgUnits <- getDHIS2OrgUnitTree()

str(OrgUnits)
# 'data.frame':	23618 obs. of  10 variables:
#   $ country_code: chr  "ymGeqzoPhN3" "ymGeqzoPhN3" "ymGeqzoPhN3" "ymGeqzoPhN3" ...
# $ country_name: chr  "République Démocratique du Congo" "République Démocratique du Congo" "République Démocratique du Congo" "République Démocratique du Congo" ...
# $ prov_code   : chr  "an1cK6GbbVw" "an1cK6GbbVw" "an1cK6GbbVw" "an1cK6GbbVw" ...
# $ prov_name   : chr  "lm Lomami Province" "lm Lomami Province" "lm Lomami Province" "lm Lomami Province" ...
# $ hz_code     : chr  "rg471npPLBo" "rg471npPLBo" "LREYTGTn3sk" "TQFaajAiBvD" ...
# $ hz_name     : chr  "lm Kamiji Zone de Santé" "lm Kamiji Zone de Santé" "lm Kanda Kanda Zone de Santé" "lm Ngandajika Zone de Santé" ...
# $ ha_code     : chr  "h1ZXAC2Rrmc" "qxTct8xFVSU" "KUknv6WSlWs" "Ag8APS0tLKf" ...
# $ ha_name     : chr  "lm Malenga Aire de Santé" "lm Kamiji Aire de Santé" "lm Mbala Cotongo Aire de Santé" "lm Heritage Aire de Santé" ...
# $ hf_name     : chr  "lm Malenga Centre de Santé" "lm Kamiji Hôpital Général de Référence" "lm Mbala Cotongo Centre de Santé" "lm Heritage Poste de Santé" ...

At this point, we might be tempted to write this structure to a CSV file called health-facilities.csv and be done. However, we still need to handle the case of hf_other. There's multiple ways of doing this, and here is just one:

# create one other option for each region.
hf_other <- OrgUnits |>
  select(-c(hf_name, hf_code)) |>
  distinct() |>
  mutate(
    hf_code = "hf_other",
    hf_name = "Autre structure.  Le nom n'est pas affiché."
  )

# add the "other" options to the main dataset
OrgUnits <- OrgUnits |>
  add_row(hf_other)

# finally, write to disk as a CSV.
write_csv(OrgUnits, "~/health-facilities.csv")

This brings us to the end of the form design process.

Closing thoughts

In my view, the search() API is an extremely powerful tool in the ODK toolbox to allow form designers to easily collect high-quality data. However, it seems to be deprecated in favor of select_one_from_file and list lookups . The main ODK documentation doesn't even mention search() among the form question types. However, to me, the main advantages are:

A single file to generate, update, and maintain. Not only is this more compact, but keeping several files (e.g. one for provinces, one for health zones, etc.) in sync is tricky in practice.
For organizations managing multiple surveys, it is easy to copy and paste previous code to get the same result using a standard CSV backend.
An unbeatable user experience for data collectors. The options load quickly, and provide a natural flow through the geographic "pyramid" structure.

If you are able to control your ODK Central infrastructure and have a good testing process to ensure this functionality continues to work with each ODK Central update, I highly encourage you to take advantage of it.