The ten commandments of data science
- 3. January 2024
- #datascience
Here are the ten commandments of data science, encapsulating what I’ve learned in my career so far.
(1) Make friends with the users. You will have to understand the problem that the business wants to solve. To model it correctly, you must map out precisely what the assumptions are, what the goal is and how the results will be used. This requires the good grace of domain experts and future users. If they do not understand or want your model, or fear being replaced by it, they will resist it and the project might be doomed from the start—so be a good friend and listen to their concerns.
(2) Ask a lot of questions. Now that you are friendly with the users, you must ask them a lot of questions. Along the way, you might also have to teach them rudimentary data science. Do they need accurate predictions, or an inferential model? How important is model interpretability? Is it even a machine learning problem? It might not be; I’ve solved more real-world problems with bipartite graphs than neural networks. What loss do they want to minimize? It’s often not RMSE. How will the system be deployed, and how often will it run? A good checklist can help keep track of questions and answers.
(3) Make friends with the developers. You might not consider yourself a software developer, but you can learn a lot from them. Version control, clean code principles and testing are basic requirements for anyone working professionally with code. Learning about object orientation, big-oh analysis, data structures and algorithms will certainly be worth your while. Befriend developers and learn from the way they work. Besides, you will likely have to deploy your model, and getting developer help at that stage makes life easier.
(4) Master the basics. Master linear algebra, calculus and probability. Then study algorithms, statistics, optimization and programming. These subjects are taught in universities worldwide for a reason—they form the foundation needed to solve real-world problems. Until you master the basics, ignore hot research papers, podcasts and medium-articles. Instead buy the influential books, read them thoroughly and solve the problems. This will take years of study, but in the end you’ll have a solid foundation from which to learn almost anything, and most real-world problems will succumb to your diverse toolbox of solution techniques.
(5) Keep it simple. If in doubt, always favor simplicity. Reduce complex problems to simple parts, understand those, then build up complexity again as needed. I’ve seen neural networks with thousands of variables used in cases where a linear model with ten variables would’ve been perfectly competitive. Start with simple models and a basic IT architecture for deployment. Use NumPy, SciPy and scikit-learn instead of implementing algorithms. Get a simple model out the door, get feedback and iterate. This creates a tight feedback loop and maximizes the probability that your future efforts are well spent.
(6) Perfect the pipeline. Build a close-to-perfect pipeline before worrying about model details. This forces you to think about the loss function, metrics, cross validation strategy and data preprocessing before you start experimenting with models. This way the models become directly comparable. If you do it the other way around, models might be evaluated on different metrics or different cross validation strategies—rendering the comparison moot.
(7) Be forthright. Be honest with yourself and with others. Decide on the test statistic and significance level before you run the analysis. Never peek at the test set. Within statistics there are countless ways of fooling yourself—and you are the easiest person to fool. Don’t exaggerate results when you present them. Many who lack statistical knowledge will have opinions about what you ought to do, so you must be firm yet humble.
(8) Balance practicality and purity. Research a problem before trying to solve it—almost all problems have been solved before. Your homemade solution will likely be orders of magnitude worse than the established algorithms. Discovering a clever paper can save you months or years of work. This is the argument in favor of research and theoretical purity. On the other hand, trying to perfectly model every imaginable aspect of a real-world problem is infeasible. Using a less-than-perfect solution beats researching some theoretical marvel for years if it never gets off the ground. You will have to find the balance between reading theory and writing code.
(9) Don’t be blinded by metrics. Machine learning competitions are all about optimizing some metric, such as RMSE, MAE or relative error. In the real world, model performance as measured by the error metric must be expertly balanced against the number of lines of code, the training time of the model, the interpretability of the model, the complexity of deployment, the data preprocessing system, and so forth. Every project will have different considerations, and you must weigh improvement in error metrics against other business considerations.
(10) Stay positive and search for problems. There’s plenty of reasons to be skeptical of how data science is used. Surprisingly many corporate initiatives start with the solution and go looking for a problem. The importance of clean, relevant and accessible data is rarely understood. There’s an argument to be made that recommendation engines do little good and there’s something eerie about predicting social outcomes of individuals. However, for every misapplication of data science there are plenty of sound and interesting problems waiting to be discovered and solved. Almost every business, government agency and institution is trying to synthesize information, understand uncertainty or make better decisions. Judicious use of mathematical models are almost always of help, and there are plenty of low-hanging fruit out there.