As an application security analyst, one of my responsibilities includes studying commonly made (and easily preventable) programming mistakes that result in potential security risks. In my experience, some of the most common flaws come from the improper validation of data read from files. In most cases, a programmer has had SOME foresight and it is rare to see data from a file used with NO verification (although, it does happen occasionally). The problem though, is that the check usually only attempts to ensure the overall length of the data and due to the wide variety of uses for data stored in files, this is often insufficient. The most obvious issue with file data is making sure it doesn't overflow an input buffer. Most standard file input techniques have some way of easily limiting the amount of data read this way. For example, fgets()
in C++ takes an argument that specifies the maximum amount of bytes to store in the provided buffer. Making sure this input buffer is the appropriate size is easy enough. However, sometimes this buffer stores a long string that actually has several values delimited by some character like a comma or semi-colon. As is often the case, the parsing of the data is usually where the issue becomes apparent.
There are any number of ways to tokenize such a string but a series of calls to strtok()
or using sscanf()
are the most common in my experience. The problem is when the delimiters are assumed to be in particular positions. For example, say I have three values to be stored in a file to represent employee usernames, the department they work in, and the day their account was created for some type of remote access program. These values could be delimited by a semi-colon like this: myname;mydepartment;mystartdate
I would probably assume that the first value in that string, the "myname" value would be of a variable length and ensure that a dynamically sized buffer would store it properly. However, I might have internally decided on a 20 character limit on department names just to make things more terse and avoid verbose department names like "tangible internal resource location management" from messing up carefully planned formatting in whatever GUI I might have designed. If this program has a data entry form for creating or editing users, it probably truncates or prompts the user to keep department names under the 20 character limit. The function that reads the stored data from the file might very well assume the data conforms to this 20 character limit and simply store the string found after the first token into a char array large enough to hold 20 characters. Which leads us to the actual problem, if the file is corrupted or controlled by a malicious user, it could very easily contain malformed data after the first delimiter. if I were an attacker and was able to gain control of the file, I could easily overflow the department name buffer with something like: a;12345678901234567890;01012010
Even if the total allowed length of this entire string is limited to a moderately small amount, making the username as small as possible still allows for several bytes of data to be written past the end of the department name buffer. This isn’t going to be exploitable in such a way as to allow arbitrary code execution in every case and the amount of effort required even when it is possible is certainly not trivial. At the very least though, it could cause a segmentation fault or otherwise affect program stability and depending on the function of the program this could allow other exploits to be taken advantage of. In any case, protecting against this type of issue is usually as easy as simply doing the same type of length checking when reading data as when writing the data to a file. Also worth mentioning, another common mistake is where numerical type mismatches can happen when using atoi()
to convert a string to an integer when it is read form a file. For example, the mydate value from the above example might be read in, passed to atoi()
with the result stored in an unsigned int without checking first if it is negative, leading to an integer underflow. Whatever the intended use of your input may be, even if you employ best practices to prevent data tampering, verifying individual pieces of data both at the reading and writing stage is a good defense in depth measure that can be taken with minimal effort.