Recently I discovered some strange defects. The defects could not consistently be reproduced and seemed to appear randomly. So what could I log into our defect management tool? “Every now and then the software seems to be unstable and give these errors” After some discussion with the development team, we decided to log the errors and assign them to me. I tried to find the root cause of the problems, basically the question was: “What went wrong?”. Following I will describe my search for the strange errors, which all appeared to have the same root cause.
Bug : This process sometimes terminates unexpectedly
So how to reproduce the termination of the process. First execute the process and see if it will work. After running the process 10 times without problems, I got bored by manually testing the process and decided to automate the execution. A loop of 1000 times executing the process still did have an abnormally terminated process. A positive side-effect was that we could now easily generate test data, since this process was the starting point for some other processes. When other people observed the ease of data generation, they wanted to use my script too. From that point on, the defect started to re-appear more frequently.
Having more appearances of the defect, it is time to check the logging again. The logging stated: “Could not insert record into …”. Ok, this was as it was with the previous appearances, but when restarting the same process, it would work. So why couldn’t it insert the record in the first place? Suddenly I had to think on a course I attended on concurrency. The database sometimes locks a table when writing in it, to make sure there is only one edit at the time. When I distributed my script to more people, the process started to run in parallel in stead of in a single thread. This was the cause of the problem! This proved to be the cause of all the inconsistent errors and could easily be resolved now by correctly implementing a lock-key mechanism.