Ten recommendations for software engineering in research
GigaScience volume 3, Article number: 31 (2014)
Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ensure the usability, sustainability and practicality of research software.
Scientific research increasingly harnesses computing as a platform , and the size, complexity, diversity and relatively high availability of research datasets in a variety of formats is a strong driver to deliver well-designed, efficient and maintainable software and tools. As the frontier of science evolves, new tools constantly need to be written; however scientists, in particular early-career researchers, might not have received training in software engineering , thus their code is in jeopardy of being difficult and costly to maintain and re-use.
To address this gap, we have compiled ten brief software engineering recommendations.
1. Keep it simple
Every software project starts somewhere. A rule of thumb is to start as simply as you possibly can. Significantly more problems are created by over-engineering than under-engineering. Simplicity starts with design: a clean and elegant data model is a kind of simplicity that leads naturally to efficient algorithms.
Do the simplest thing that could possibly work, and then double-check it really does work.
2. Test, test, test
For objectivity, large software development efforts assign different people to test software than those who develop it. This is a luxury not available in most research labs, but there are robust testing strategies available to even the smallest project.
Unit tests are software tests which are executed automatically on a regular basis. In test driven development, the tests are written first, serving as a specification and checking every aspect of the intended functionality as it is developed . One must make sure that unit tests exhaustively simulate all possible – not only that which seems reasonable – inputs to each method.
3. Do not repeat yourself
Do not be tempted to use the copy-paste-modify coding technique when you encounter similar requirements. Even though this seems to be the simplest approach, it will not remain simple, because important lines of code will end up duplicated. When making changes, you will have to do them twice, taking twice as long, and you may forget an obscure place to which you copied that code, leaving a bug.
Automated tools, such as Simian , can help to detect and fix duplication in existing codebases. To fix duplications or bugs, consider writing a library with methods that can be called when needed.
4. Use a modular design
Modules act as building blocks that can be glued together to achieve overall system functionality. They hide the details of their implementation behind a public interface, which provides all the methods that should be used. Users should code – and test – to the interface rather than the implementation . Thus, concrete implementation details can change without impacting downstream users of the module. Application programming interfaces (APIs) can be shared between different implementation providers.
Scrutinise modules and libraries that already exist for the functionality you need. Do not rewrite what you can profitably re-use – and do not be put off if the best candidate third-party library contains more functionality than you need (now).
5. Involve your users
Users know what they need software to do. Let them try the software as early as possible, and make it easy for them to give feedback, via a mailing list or an issue tracker. In an open source software development paradigm, your users can become co-developers. In closed-source and commercial paradigms, you can offer early-access beta releases to a trusted group.
Many sophisticated methods have been developed for user experience analysis. For example, you could hold an interactive workshop .
6. Resist gold plating
Sometimes, users ask for too much, leading to feature creep or “gold plating”. Learn to tell the difference between essential features and the long list of wishes users may have. Prioritise aggressively with as broad a collection of stakeholders as possible, perhaps using “game-storming” techniques .
Gold plating is a challenge in all phases of development, not only in the early stages of requirements analysis. In its most mischievous disguise, just a little something is added in every iterative project meeting. Those little somethings add up.
7. Document everything
Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically challenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.
It is better to have two or three lines of code that are easy to understand than to have one incomprehensible line, for example see Figure 1.
Write an easily accessible module guide for each module, explaining the higher level view: what is the purpose of this module? How does it fit together with other modules? How does one get started using it?
8. Avoid spaghetti
Since GOTO-like commands fell justifiably out of favour several decades ago , you might believe that spaghetti code is a thing of the past. However, a similar phenomenon may be observed in inter-method and inter-module relationships (see Figures 3 and 4). Debugging – stepping through your code as it executes line by line – can help you diagnose modern-day spaghetti code. Beware of module designs where for every unit of functionality you have to step through several different modules to discover where the error is, and along the way you have long lost the record of what the original method was actually doing or what the erroneous input was. The use of effective and granular logging is another way to trace and diagnose problems with the flow through code modules.
9. Optimise last
Beware of optimising too early. Although research applications are often performance-critical, until you truly encounter the wide range of inputs that your software will eventually run against in the production environment, it may not be possible to anticipate where the real bottlenecks will lie. Develop the correct functionality first, deploy it and then continuously improve it using repeated evaluation of the system running time as a guide (while your unit tests keep checking that the system is doing what it should).
10. Evolution, not revolution
Maintenance becomes harder as a system gets older. Take time on a regular basis to revisit the codebase, and consider whether it can be renovated and improved . However, the urge to rewrite an entire system from the beginning should be avoided, unless it is really the only option or the system is very small. Be pragmatic  – you may never finish the rewrite . This is especially true for systems that were written without following the preceding recommendations.
Effective software engineering is a challenge in any enterprise, but may be even more so in the research context. Among other reasons, the research context can encourage a rapid turnover of staff, with the result that knowledge about legacy systems is lost. There can be a shortage of software engineering-specific training, and the “publish or perish” culture may incentivise taking shortcuts.
The recommendations above give a brief introduction to established best practices in software engineering that may serve as a useful reference. Some of these recommendations may be debated in some contexts, but nevertheless are important to understand and master. To learn more, Table 1 lists some additional online and educational resources.
Goble C: Better software, better research. IEEE Internet Comput. 2014, 18: 4-8.
Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P: Best practices for scientific computing. PLoS Biol. 2014, 12: e1001745-10.1371/journal.pbio.1001745.
Beck K: Test Driven Development. 2002, New York: Addison-Wesley
Simian: Similarity Analyzer. [http://www.harukizaemon.com/simian/]
Bloch J: Effective Java. 2008, New York: Addison-Wesley
Pavelin K, Pundir S, Cham JA: Ten simple rules for running interactive workshops. PLoS Comput Biol. 2014, 10: e1003485-10.1371/journal.pcbi.1003485.
Gray D, Brown S, Macanufo J: Game-storming: A playbook for innovators rulebreakers and changemakers. 2010, Sebastopol, CA: O’Reilly
Martin RC: Clean Code: A Handbook of Agile Software Craftsmanship. 2008, New York: Prentice Hall
Dijkstra EW: Letters to the editor: go to statement considered harmful. Commun ACM. 1968, 11 (3): 147-148. 10.1145/362929.362947.
Fowler M: Refactoring: Improving the Design of Existing Code. 1999, Reading, Massachusetts: Addison Wesley
Hunt A, Thomas D: The Pragmatic Programmer. 2000, Reading, Massachusetts: Addison Wesley Longman
Brooks FP: The Mythical Man Month. 1975, New York: Addison-Wesley
The Git distributed version control management system. [http://git-scm.com/]
GitHub: An online collaboration platform for Git version-controlled projects. [http://github.com/]
This commentary is based on a presentation given by JH at a workshop on Software Engineering held at the 2014 annual Metabolomics conference in Tsuruoka, Japan. The authors would like to thank Saravanan Dayalan for organising the workshop and giving JH the opportunity to present. We would furthermore like to thank Robert P. Davey and Chris Mungall for their careful and helpful reviews of an earlier version of this manuscript.
The authors declare that they have no competing interests.
JH prepared the initial draft. All authors contributed to, and have read and approved, the final version.