What went wrong?

ConcurrencyRecently I discovered some strange defects. The defects could not consistently be reproduced and seemed to appear randomly. So what could I log into our defect management tool? “Every now and then the software seems to be unstable and give these errors” After some discussion with the development team, we decided to log the errors and assign them to me. I tried to find the root cause of the problems, basically the question was: “What went wrong?”. Following I will describe my search for the strange errors, which all appeared to have the same root cause.

Bug : This process sometimes terminates unexpectedly

So how to reproduce the termination of the process. First execute the process and see if it will work. After running the process 10 times without problems, I got bored by manually testing the process and decided to automate the execution. A loop of 1000 times executing the process still did have an abnormally terminated process. A positive side-effect was that we could now easily generate test data, since this process was the starting point for some other processes. When other people observed the ease of data generation, they wanted to use my script too. From that point on, the defect started to re-appear more frequently.

Having more appearances of the defect, it is time to check the logging again. The logging stated: “Could not insert record into …”. Ok, this was as it was with the previous appearances, but when restarting the same process, it would work. So why couldn’t it insert the record in the first place?  Suddenly I had to think on a course I attended on concurrency. The database sometimes locks a table when writing in it, to make sure there is only one edit at the time. When I distributed my script to more people, the process started to run in parallel in stead of in a single thread.  This was the cause of the problem! This proved to be the cause of all the inconsistent errors and could easily be resolved now by correctly implementing a lock-key mechanism.


Test environments

Test environmentWhat to do when you are completely dependant of other companies who maintain parts of your environments? For a couple of clients I have been in this situation and not being able to execute important tests is a pain. I was able to do parts of tests, but the aim was a end-to-end test. At first I thought about mocking the missing part, but it provide complex functionality which is almost impossible to mimic. Creating my own XML files as an answer to the requests allowed us to check some of the end-to-end functionality, but I could not recreate all the XMLs that were needed.
My work changed from testing functionality to creating/updating documentation. This actually provided me with extra knowledge about some parts that were already tested. So back to my testing role and update the test scripts. Three days later finally a working test environment, so execute the tests and try to make up for some lost time. That was fun while it lasted… After two hours of testing, the environment was broken again and support did not have time to look at the environment. Does anyone recognise this situation? I’m wondering how companies can get away with this and how their clients can leverage them to do their work. Just changing to another company to supply the environment is not an option, the vendor lock-in is present…