Ten recommendations for software engineering in research

Hastings, Janna; Haug, Kenneth; Steinbeck, Christoph

doi:10.1186/2047-217X-3-31

Commentary
Open access
Published: 04 December 2014

Ten recommendations for software engineering in research

Janna Hastings¹,
Kenneth Haug¹ &
Christoph Steinbeck¹

GigaScience volume 3, Article number: 31 (2014) Cite this article

4980 Accesses
9 Citations
37 Altmetric
Metrics details

Abstract

Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ensure the usability, sustainability and practicality of research software.

Peer Review reports

Background

Scientific research increasingly harnesses computing as a platform [1], and the size, complexity, diversity and relatively high availability of research datasets in a variety of formats is a strong driver to deliver well-designed, efficient and maintainable software and tools. As the frontier of science evolves, new tools constantly need to be written; however scientists, in particular early-career researchers, might not have received training in software engineering [2], thus their code is in jeopardy of being difficult and costly to maintain and re-use.

To address this gap, we have compiled ten brief software engineering recommendations.

Recommendations

1. Keep it simple

Every software project starts somewhere. A rule of thumb is to start as simply as you possibly can. Significantly more problems are created by over-engineering than under-engineering. Simplicity starts with design: a clean and elegant data model is a kind of simplicity that leads naturally to efficient algorithms.

Do the simplest thing that could possibly work, and then double-check it really does work.

2. Test, test, test

For objectivity, large software development efforts assign different people to test software than those who develop it. This is a luxury not available in most research labs, but there are robust testing strategies available to even the smallest project.

Unit tests are software tests which are executed automatically on a regular basis. In test driven development, the tests are written first, serving as a specification and checking every aspect of the intended functionality as it is developed [3]. One must make sure that unit tests exhaustively simulate all possible – not only that which seems reasonable – inputs to each method.

3. Do not repeat yourself

Do not be tempted to use the copy-paste-modify coding technique when you encounter similar requirements. Even though this seems to be the simplest approach, it will not remain simple, because important lines of code will end up duplicated. When making changes, you will have to do them twice, taking twice as long, and you may forget an obscure place to which you copied that code, leaving a bug.

Automated tools, such as Simian [4], can help to detect and fix duplication in existing codebases. To fix duplications or bugs, consider writing a library with methods that can be called when needed.

4. Use a modular design

Modules act as building blocks that can be glued together to achieve overall system functionality. They hide the details of their implementation behind a public interface, which provides all the methods that should be used. Users should code – and test – to the interface rather than the implementation [5]. Thus, concrete implementation details can change without impacting downstream users of the module. Application programming interfaces (APIs) can be shared between different implementation providers.

Scrutinise modules and libraries that already exist for the functionality you need. Do not rewrite what you can profitably re-use – and do not be put off if the best candidate third-party library contains more functionality than you need (now).

5. Involve your users

Users know what they need software to do. Let them try the software as early as possible, and make it easy for them to give feedback, via a mailing list or an issue tracker. In an open source software development paradigm, your users can become co-developers. In closed-source and commercial paradigms, you can offer early-access beta releases to a trusted group.

Many sophisticated methods have been developed for user experience analysis. For example, you could hold an interactive workshop [6].

6. Resist gold plating

Sometimes, users ask for too much, leading to feature creep or “gold plating”. Learn to tell the difference between essential features and the long list of wishes users may have. Prioritise aggressively with as broad a collection of stakeholders as possible, perhaps using “game-storming” techniques [7].

Gold plating is a challenge in all phases of development, not only in the early stages of requirements analysis. In its most mischievous disguise, just a little something is added in every iterative project meeting. Those little somethings add up.

7. Document everything

Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically challenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.

It is better to have two or three lines of code that are easy to understand than to have one incomprehensible line, for example see Figure 1.

Write clean code [8] that you would want to maintain long-term (Figure 2). Meaningful, readable variable and method names are a form of documentation.

Write an easily accessible module guide for each module, explaining the higher level view: what is the purpose of this module? How does it fit together with other modules? How does one get started using it?

8. Avoid spaghetti

Since GOTO-like commands fell justifiably out of favour several decades ago [9], you might believe that spaghetti code is a thing of the past. However, a similar phenomenon may be observed in inter-method and inter-module relationships (see Figures 3 and 4). Debugging – stepping through your code as it executes line by line – can help you diagnose modern-day spaghetti code. Beware of module designs where for every unit of functionality you have to step through several different modules to discover where the error is, and along the way you have long lost the record of what the original method was actually doing or what the erroneous input was. The use of effective and granular logging is another way to trace and diagnose problems with the flow through code modules.

9. Optimise last

Beware of optimising too early. Although research applications are often performance-critical, until you truly encounter the wide range of inputs that your software will eventually run against in the production environment, it may not be possible to anticipate where the real bottlenecks will lie. Develop the correct functionality first, deploy it and then continuously improve it using repeated evaluation of the system running time as a guide (while your unit tests keep checking that the system is doing what it should).

10. Evolution, not revolution

Maintenance becomes harder as a system gets older. Take time on a regular basis to revisit the codebase, and consider whether it can be renovated and improved [10]. However, the urge to rewrite an entire system from the beginning should be avoided, unless it is really the only option or the system is very small. Be pragmatic [11] – you may never finish the rewrite [12]. This is especially true for systems that were written without following the preceding recommendations.

Use a good version control system (e.g., Git [13]) and a central repository (e.g., GitHub [14]). In general, commit early and commit often, and not only when refactoring.

Conclusion

Effective software engineering is a challenge in any enterprise, but may be even more so in the research context. Among other reasons, the research context can encourage a rapid turnover of staff, with the result that knowledge about legacy systems is lost. There can be a shortage of software engineering-specific training, and the “publish or perish” culture may incentivise taking shortcuts.

The recommendations above give a brief introduction to established best practices in software engineering that may serve as a useful reference. Some of these recommendations may be debated in some contexts, but nevertheless are important to understand and master. To learn more, Table 1 lists some additional online and educational resources.

Table 1 Further reading

Full size table

References

Goble C: Better software, better research. IEEE Internet Comput. 2014, 18: 4-8.
Article Google Scholar
Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P: Best practices for scientific computing. PLoS Biol. 2014, 12: e1001745-10.1371/journal.pbio.1001745.
Article PubMed PubMed Central Google Scholar
Beck K: Test Driven Development. 2002, New York: Addison-Wesley
Google Scholar
Simian: Similarity Analyzer. [http://www.harukizaemon.com/simian/]
Bloch J: Effective Java. 2008, New York: Addison-Wesley
Google Scholar
Pavelin K, Pundir S, Cham JA: Ten simple rules for running interactive workshops. PLoS Comput Biol. 2014, 10: e1003485-10.1371/journal.pcbi.1003485.
Article PubMed PubMed Central Google Scholar
Gray D, Brown S, Macanufo J: Game-storming: A playbook for innovators rulebreakers and changemakers. 2010, Sebastopol, CA: O’Reilly
Google Scholar
Martin RC: Clean Code: A Handbook of Agile Software Craftsmanship. 2008, New York: Prentice Hall
Google Scholar
Dijkstra EW: Letters to the editor: go to statement considered harmful. Commun ACM. 1968, 11 (3): 147-148. 10.1145/362929.362947.
Article Google Scholar
Fowler M: Refactoring: Improving the Design of Existing Code. 1999, Reading, Massachusetts: Addison Wesley
Google Scholar
Hunt A, Thomas D: The Pragmatic Programmer. 2000, Reading, Massachusetts: Addison Wesley Longman
Google Scholar
Brooks FP: The Mythical Man Month. 1975, New York: Addison-Wesley
Book Google Scholar
The Git distributed version control management system. [http://git-scm.com/]
GitHub: An online collaboration platform for Git version-controlled projects. [http://github.com/]

Download references

Acknowledgements

This commentary is based on a presentation given by JH at a workshop on Software Engineering held at the 2014 annual Metabolomics conference in Tsuruoka, Japan. The authors would like to thank Saravanan Dayalan for organising the workshop and giving JH the opportunity to present. We would furthermore like to thank Robert P. Davey and Chris Mungall for their careful and helpful reviews of an earlier version of this manuscript.

Author information

Authors and Affiliations

Cheminformatics and Metabolism, European Molecular Biology Laboratory – European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD, Hinxton, UK
Janna Hastings, Kenneth Haug & Christoph Steinbeck

Authors

Janna Hastings
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Haug
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Steinbeck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janna Hastings.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JH prepared the initial draft. All authors contributed to, and have read and approved, the final version.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Hastings, J., Haug, K. & Steinbeck, C. Ten recommendations for software engineering in research. GigaSci 3, 31 (2014). https://doi.org/10.1186/2047-217X-3-31

Download citation

Received: 25 September 2014
Accepted: 19 November 2014
Published: 04 December 2014
DOI: https://doi.org/10.1186/2047-217X-3-31

Ten recommendations for software engineering in research

Abstract

Background