Database Reliability- Part 2 (Nov 28)

When writing new code for our estimating and accounting software, we have a rule of thumb. When it first works, it’s 1/3 done. When it works for everything the programmer can think of to break it, it’s 2/3 done.  After it survives testing by people actually using it, it (eventually) reaches 100%.

We built the new database for Goldenseal Pro more than a year ago, but it was bare bones, and really just the first 1/3 of the work. It was sufficient for initial interface testing, and to make sure the basic approach was valid. The past couple weeks we added the rest of what a database needs, so it’s now at the 2/3 mark. The final 1/3 will be polishing, tweaking and testing, done gradually over the next few months. The final test will be using it to handle our TurtleSoft business accounting for a while.

Our primary goal for the new database code is to make it extremely reliable. We suspect that there are still a few subtle bugs remaining in the NeoAccess database that runs Goldenseal 4.9x, and we want to avoid that for Goldenseal Pro.

One specific way to increase reliability is to simplify the basic structure of the database.

NeoAccess used something called a binary tree (or b-tree) to store the location of each record within the file. B-trees are optimized to access records via binary search, which is the most efficient way to find anything. It only takes log-2 time to find any record in a sorted list. That means a binary search in a thousand records only takes 10 steps, a million records only takes 20 steps, and a billion takes 30. Binary search is amazingly fast when you are indexing, say, all humans on Earth.

The problem is, b-trees are complicated. The branches need to be balanced and trimmed, and there are many ways to go wrong. If there are still bugs left in our modified version of NeoAccess, they are probably buried in the tree management code. It’s just too complex, so we never touched it.

For Goldenseal Pro we use simple lists, rather than b-trees. They are slower to search, but on our scale it’s only a matter of microseconds.  B-trees are gross overkill for small businesses with mere hundreds or thousands of records. Simple linear searches are easy to debug, and much less fragile. Even when a b-tree is appropriate, it’s far better to use a well-tested library, rather than “roll your own” code as in NeoAccess.

For large numbers of records, we still use a tree structure. But it is a very “broad” tree that rarely grows more than two levels deep. It’s more like a shrub or low-maintenance ground cover. Theoretically, the new database won’t be as efficient as a true binary tree. However, we much prefer increased simplicity and reliability, over a tiny performance gain.

It’s possible to do binary searches without a b-tree, and binary search is even built into the C++ standard library that we use. However, binary searches are harder to debug, so we will only add them in places where linear search is visibly slow. That’s part of the tweaking process.

************************

Another part of making code reliable is to write cautious code. That means we frequently “sanity check”, and give an error message if anything is amiss. It makes our development process much easier, since most bugs give a message that takes us right to the problem. If any bugs survive into the final version, users will get a warning, rather than a hidden problem that bites them later.

The main reason that Goldenseal 4.9x rarely crashes, is because it sanity checks first. There are 6,700 places in our code where we confirm that something really exists, and 1,600 places where we check for a reasonable value. There are more thousands of sanity checks built into the code logic, but we can’t count those easily.

Along with our usual cautious coding, Goldenseal Pro includes several layers of sanity checking to make sure the database stays healthy. I’ll talk more about that in a future post.

Dennis Kolva
Programming Director
TurtleSoft.com

 

Database Reliability (Nov 20)

TurtleSoft started out in 1987 with MacNail, a set of Excel templates for construction estimating. Later we added accounting and scheduling features. The MacNail software was extremely popular, and it still has a few hundred die-hard users. Unfortunately, we could only do so much with Excel macros and spreadsheets. MacNail was complicated to use, and difficult to support.

In 1989 we released a second estimating program, built atop Apple’s HyperCard. It had a much nicer human interface, but severe limitations on the back end. Among other problems, HyperCard introduced us to “race conditions”. That’s where the app has intermittent bugs, depending on which message path completes first (unpredictably). They are extremely hard to track down.

In the early 90s, Microsoft replaced the Excel macro language with Visual Basic, and Apple halted development on HyperCard. It was clearly time for us to move on.

For the next software generation, we looked at many database programs to build upon: FileMaker, Omnis, Double Helix, Sirius Developer, Prograph. Each was better in some ways than what we had already, but in some ways each was worse. Rather than build a mediocre app on top of some other software, we finally decided to write a “real” application in C++. Coding from the ground up takes longer, but allows full control. As it happens, it also prevented us from riding a platform to its death, since all but FileMaker are now long gone.

We were too inexperienced to write our own database code, so we tested several C++ database libraries, and settled on one called NeoAccess. It was quite popular in the mid 1990s. The database performed well, and wasn’t too difficult to build upon. Unfortunately, when Goldenseal was almost completed, we discovered that NeoAccess had serious bugs. Files eventually became corrupted, and all data was lost. The developers at NeoLogic never fixed the bugs, and eventually stopped answering emails. 

AOL 4.0 and Netscape Communicator also used NeoAccess. AOL abandoned it in their version 5.0. Netscape retired their software prematurely (and was soon swallowed by AOL). Dozens of other companies went through similar drama, thanks to an unreliable database engine.

We also considered abandoning NeoAccess, but the alternatives were just as buggy. So we put about 2 programmer-years into rewriting their code to make it more reliable. We learned a lot in the process, which is why we decided to build our own database for Goldenseal Pro. You might say that bugs that you create yourself are much easier to fix, than ones made by other people.

The first step in writing reliable software is to make the code maintainable. That means simple logic, clear comments, and easily readable code that anyone can understand. NeoAccess was completely the opposite, and that was its primary flaw. To fix it, we stepped through their code hundreds of times, and gradually rewrote it to make more sense. A few of the bugs were simple logic errors that finally stood out when the code was less cryptic. Some of the bugs were in code that was so convoluted that we just gave up, and rewrote it entirely.

BTW most open-source code is equally unreadable. Whenever you hear in the news about a major security breach, the root cause was probably some important open-source code that nobody understands. If other programmers can’t read it, then they can’t fix it or improve it.

There are more specific methods we are using to make the database more reliable in Goldenseal Pro. NeoAccess taught us a lot about database design, both from its good parts and its bad. I’ll cover them in more detail, in a future post.

Meanwhile, the past couple weeks we have been working on the new database code for Goldenseal Pro. It is mostly working, but probably still needs another week to finish.

Dennis Kolva
Programming Director
TurtleSoft.com

 

Goldenseal Pro Progress Report (Nov 10)

Last week, Goldenseal Pro saved record changes to disk for the first time. It was very exciting to enter data, quit, reopen the file, and see the changes still there. This may not seem like much, but it is a major milestone. It’s the last big connection needed between the new human interface code, and the business logic that runs our estimating and accounting software.

Since then, we have been working on the database code, which manages how records are stored on the hard drive.  The new code we wrote last year has performed well while importing existing Goldenseal files, and then viewing records. However, it needs more bells and whistles to properly manage changes, deletions, and other daily usage.

We often complain about the NeoAccess database library used in Goldenseal, but it did have some very good features. We had to fix a few serious bugs in the late 1990s, but since then it has run thousands of company’s files, with few problems.  As we write the code that replaces it, our goal is to keep the good features from the old database, but make it more reliable, and easier to maintain.

When Goldenseal 4.x opens any record, it reads the data from hard drive to RAM. It then stays “cached”, either until memory got low, or until our code deliberately purges the cache. Reducing disk accesses makes many operations run much faster.

Goldenseal Pro uses a similar cache system. The main difference is that we simplified the list of cached records, so it is easier to troubleshoot. The same list also handles “dirty” records, as described in the previous post.

Every database has to decide where to put new and changed records during a save. The easiest place is at the end of the file. Unfortunately, that eventually results in a huge file that is mostly holes. For example, FileMaker used to save everything at the end (and maybe still does). Files can grow to 10s or 100s of megabytes, and eventually need a compression step to get back to a reasonable size.

NeoAccess solved that problem with a list of gaps between records. That way, new or changed records could go into the empty spaces. Unfortunately, their code was hard to debug, and we suspected that it sometimes wrote records on top of existing ones (extremely bad news). In 2002 we replaced their code with a File Manager to track every record and every empty space. That system worked well, but it used an extremely large record that sometimes crashed, if memory was too low.

For Goldenseal Pro, we’ve split the original file manager into separate, smaller Sector Managers. Each tracks records in just one portion of the file (currently 32,000 records). There is also a separate Gap Manager that tracks all empty spaces. It decides where to put each record, when it is saved. The sector managers have been working great for over a year, and we are refining the gap manager now. It’s tricky, because the manager needs to combine gaps that touch, and some records work better with empty space before or after.

Goldenseal Pro stores the location of every record in two different places: in the main indexing system, and in the sector managers. That will allow the new database to recover from file corruption, by fixing the one that is wrong.  There are places in Goldenseal 4.x files where a single bit change can totally trash the file. In Pro, recovery will be possible from almost all types of “bit rot”.

Dennis Kolva
Programming Director
TurtleSoft.com

 

 

 

 

 

 

Dirty Records (Nov 2)

At heart, Goldenseal is a database program. It stores your company data on a hard drive so you can use it later. One important part of that process is knowing which data is “dirty”. In programmer talk, that means a record has changed, so it needs to be saved. After it’s saved, it becomes “clean” again. It’s hard to say whether the origins of the term are from laundry, or religion!

The old NeoAccess database had a true/false “dirty bit” in each record, to mark whether it had changed. It also had a “busy bit” for items currently in use. At every save, it looked through every record stored in RAM, and saved anything that was dirty but not busy. The system usually worked OK, but it was extremely hard to troubleshoot. If something didn’t work right, it meant a half an hour in the debugger, stepping through many hundreds of code lines.

We ended up adding internal lists of busy and dirty records, just so we could see what was going on more easily. Those lists worked so well that they now replace the old NeoAccess system entirely, in our new database code.

When you first look at an existing record, it starts out clean. Then it turns dirty the moment you change anything. To make that work, each data entry field has to send a message, and change the record status to dirty. Unfortunately, the existing code is a hodge-podge. There is makeDirty, setDirty and MakeDirty (case matters), and there are bits stored in 3 different places. It’s partly because different programmers reinvented the wheel, partly because NeoAccess was ugly, and partly because PowerPlant didn’t separate view and controller functions.

When we “refactor” something like this, it’s a bit like knocking out support members one by one, to see if the building collapses. If it does, we hit undo and decide whether to keep it, or rebuild it. If not, we probably can remove that code, or merge it with something else.

This all started because we need a quick way to decide whether the Save button should be grayed-out, or not. However, the dirty status has important consequences in the multi-user version, where more than one person might change the same record. It’s worth spending a day or two to redesign the system, so it’s more understandable.

Dennis Kolva
Programming Director
TurtleSoft.com