Core data science skills: filling the gaps with community developed workshops

By Miranda Prynne, 2 February, 2022
View
The scientific research apprenticeship model lacks training in core data science skills. Community-developed workshops delivered by academic peers can help to fill this gap, Alison Meynert and Edward Wallace explain
Article type
Article
Main text

Core data science skills are needed for all kinds of scientific research. While many excellent resources are available, putting together a skills training programme suitable for your research institute is a challenge. Interdisciplinary research programmes attract students and staff with a wide range of background knowledge and skills. Graduate students are funded by a hodgepodge of schemes with different training requirements and support. Training of postdocs and early career researchers can be neglected, and many struggle to build the skills they need to progress their research and careers. Students and staff alike can start any time of year, though there is often a cohort of new students each autumn.

We are leading a UKRI-funded programme called Ed-DaSH, developing new training workshop materials available to the whole research community. We are working with The Carpentries, an inclusive community teaching coding and data skills, whose pedagogical model of collaborative hands-on learning we have adopted in our workshops. The workshop topics include statistics, Fair (findability, accessibility, interoperability and reuse) principles of data management, and workflow management systems. Starting in autumn 2021, our institutes began using these new materials in core data science training programmes, focused initially on the PhD student intake but available to all staff and students.

Identify unmet training needs

What does your audience need to learn to fulfil their potential as researchers? Surveys are a good start, especially if these are short and easy to complete. For example, a survey of the MRC Human Genetics Unit’s parent Institute of Genetics and Cancer regarding statistical training needs and support found high demand for both one-to-one training and workshops.

However, surveying can only capture what you ask about, and what people know they need right now. Future needs must also be looked after, especially for early career researchers. Observations from experts need to be factored in: what is the bleeding edge doing? We could observe a sea change towards workflow management systems in health and bioscience research and a lack of training to incorporate these into everyday usage.

Collect local training

What is your institute offering? And how is that related to the training needs you’ve identified? Your survey can tell you what training has helped in the past, and it is also helpful to collate existing post-training surveys. If training is offered via your institute’s graduate research programme, is it open only to the students on that programme, or could it be made more widely available? We had a number of locally developed workshops, such as an introduction to our university computer cluster and an introduction to genome browsers, that were well received and in high demand.

Look for community-developed training

What training and resources have other people developed that you can use? Don’t waste time reinventing the wheel. Training material developed by others in the research community is often freely available, adaptable and high quality. Even better, academic research communities will generally welcome contributions and feedback.

We believe that to foster a living curriculum, it is worth letting go of some control over its content. Data science is in the fortunate position of having access to the open source Carpentries workshop material. We use lessons from the Software Carpentry and Data Carpentry suites, covering the basics of the Unix shell, Python and R.

Don’t be afraid to adapt: we previously offered an internally developed genomics workshop, but we have replaced this with a more up-to-date Carpentries lesson. Lessons developed with the input of the wider research community are tested and updated by hundreds of instructors worldwide, making them easier to share across institutes. Our local Edinburgh Carpentries community facilitates collaboration.

Fill in the training gaps

Looking back at your training needs, what is missing? In our case, it was statistics, data management and workflow management systems that we felt most needed new material. If you have the capacity to begin right away and are interested in making your new material a community effort, talk to the Carpentries about their Incubator scheme, and take a look at their Curriculum Development Handbook. For funding, as well as UKRI, the Software Sustainability Institute generally and Elixir specifically for biosciences may have relevant schemes.

Timing and audience

When, where and how do you want to deliver your new training programme?

  1. Audience: are you primarily targeting a single intake of PhD students or a broader group of researchers
  2. Intensity: a week or two of full-day sessions can be efficient to schedule, but our feedback has been that it can overload the learners. A mix of half- and full-day sessions spread over a longer period may improve retention and engagement.
  3. Timing: researchers are most engaged with training when they know they need it; for example, when they have data to analyse. Advanced topics may be more valuable later in a programme than loaded on to new PhD students’ first few weeks.
  4. Sequencing: what order to teach? Some workshops will have obvious prerequisites. For example, an introduction to the R programming language must happen before a workshop that teaches statistics through programming in R.
  5. External restrictions: what other requirements do your learners have? Look at the student handbook to work around anything already scheduled. Avoid deadlines for first-year reports, and big conferences in the field.
  6. Resource: who is available to teach the workshop? How much time do they have? How is that time funded? The Carpentries’ community approach helps by bringing back past learners as helpers, who then strengthen their own data skills by training others.

Training for the long term

Data science is here for the long term, and your programme will need to evolve with changing needs. Collecting feedback and, more importantly, acting on it will keep your programme relevant and effective. Community-developed materials help to spread the burden of keeping your lessons current, and you can pay it forward by contributing fixes and updates based on your experiences.

Alison Meynert is a senior research fellow and bioinformatics analysis core manager in the MRC Human Genetics Unit, and Edward Wallace is a Sir Henry Dale fellow in the School of Biological Sciences, both at the University of Edinburgh.

If you found this interesting and want advice and insight from academics and university staff delivered directly to your inbox each week, sign up for the THE Campus newsletter.

Standfirst
The scientific research apprenticeship model lacks training in core data science skills. Community-developed workshops delivered by academic peers can help to fill this gap, Alison Meynert and Edward Wallace explain

comment