Creating Random Access Text in C#

Contents of the Two Files

Back in April I looked at disk folders as a possible alternative to NoSQL or using a relational DB. My conclusion wasn’t encouraging—I was concerned about poor performance, especially on Linux.

The use case I examined was for a server that had from 100,000 to 1 million users. I wanted to store and retrieve text files for any user. Those files could vary in length from a few bytes to a few KB. Back in the dark ages—before the Web and before databases were in common use—I did a fair bit of this type of code. It was written in Turbo Pascal using binary files. Life was simpler then. There was no Unicode to contend with, just ASCII 8 bit chars.

Click here to find C# jobs.

But the world has moved on. In this article, I’ll be creating a class that stores proper C# strings in a data file using an index file for fast string access and retrieval.

The principle is straightforward: There’s an index file and a data file. The index file is made up of n index records where n is the number of strings to be stored. Each index record consists of two fields and is held in an instance of type IndexRecord. I’ve used UInt32 (0-4GB).

public UInt32 offset { get; set; }

public UInt32 length { get; set; }

Now you might wonder why length is included, since that’s stored with the string when you write it with a BinaryWriter. The reason is that it’s needed for comparison when changing a string. If the length of the new string is the same or a lesser size, then the old offset is used. If it’s longer, the offset is changed to the end of the data file and the text is written there.

Files and Streams

To make this work requires a FileStream which connects a file to an in-memory stream and uses a BinaryWriter and BinaryReader, created on the fly to do the reading or writing per string access.

This code, for instance, reads in the entire indexfile into a List<IndexRecord>. The statement Position != Length is an end of file check.

using (var fs=File.OpenRead(indexFilename))

}

using (var br = new BinaryReader(fs))

{

while (br.BaseStream.Position != br.BaseStream.Length)

{

var indexRec = new IndexRecord

{

offset = br.ReadUInt32(),

length = br.ReadUInt32(),

};

indexes.Add(indexRec);

}

}

}

The Indices class manages a List of IndexRecords and has the Boolean property IsDirty(). This is set to true whenever a new index is added or changed. The Save method only rewrites the complete index to the index file if IsDirty() is true.

The TextRecords Class

This is the class that you instantiate. It encapsulates the index list, providing methods to Add, Change or GetString. In this case, the management of the FileStream instance differs from how it was used in the example above. It’s a class-level variable. Once opened, it’s kept open by using the optional LeaveOpen() parameter in the BinaryWriter and BinaryReader constructors. The default Encoding in the example above is UTF8. To set the LeaveOpen() parameter requires the Encoding to be explicitly stated.

using (var bw = new BinaryWriter(fs,Encoding.UTF8,true))

{

bw.Write(text);

}

Although it would seem that creating an instance of a BinaryWriter would slow things down, it runs very quickly, writing 100,000 short strings (about 500Kb total) in just over a second on an i7 950 PC that’s getting on four years old.

The downside of keeping a FileStream open through the life of the class is that it requires that class to implement the IDisposable interface and ensure that the FileStream.Dispose() method is called. This is also a good place to call indices.Save() to save out the index file if it has been changed. It uses the recommended way of implementing Dispose() which prevents the Finalizer from calling Dispose() as well. The Finalizer is a rough equivalent to a C++ destructor and is called from the garbage collector. It’s non-deterministic as to when it’s called, so explicitly disposing is better.

public class TextRecords : IDisposable

{

}

 

public void Dispose()

{

Dispose(true);

GC.SuppressFinalize(this); // Prevent Dispose() from being called Twice

}

 

protected virtual void Dispose(bool disposing)

{

if (disposing)

{

indices.Save();

try

{

fs.Dispose();

}

catch

{

}

}

}

The interplay of FileStream and BinaryReader/BinaryWriter can be a little confusing. The FileStream has methods Seek() and Position(). Seek() moves the file cursor to a specified point and Position() returns where the file cursor is currently located. With a new file, the cursor starts at 0 and is moved by using the BinaryWriter to write something. For reading a string back, the offset position is read from the specified index record. Then the BinaryReader object reads the string from the disk. This is GetString(), below, doing that.

public string GetString(int index)

{

var record = indices.GetIndex(index);

if (record == null)

{

return @””;

}

using (var br = new BinaryReader(fs, Encoding.UTF8,true))

{

fs.Seek(record.offset, SeekOrigin.Begin);

return br.ReadString();

}

}

The FileStream Seek() method can seek relative to the start position, end or relative to the current file cursor. So moving the file cursor to the end of the data file is done with this:

fs.Seek(0, SeekOrigin.End);

Whereas the strings are located using the index record’s offset, relative to the start of the data file like this:

fs.Seek(record.offset, SeekOrigin.Begin);

It’s Getting Larger!

Every time a string is changed, and the new string is larger than the old, it gets added onto the end of the data file and the index is updated to point to it. The space previously occupied by the string remains, so over time the data file will grow larger and larger. It doesn’t need new strings to be added either. If you swapped two strings of different length, then the longer one would be written at the end of the data file.

To recover this lost space, the TextRecords.Compress() method has been added. This copies the data file string by string into a temporary file. It saves out the indices (now pointing into the temporary data file), disposes of the FileStream object, renames the old data file to a backup and reloads the indices and reopens the FileStream object.

The C# 5.0 source code from this has been posted on SourceForge. It consists of project files including two source files.

Related Articles

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you’re awesome.

Comments

  1. BY IRNÉ says:

    Looks like you re-invented a flat indexed database. Similar to the old mbox format. Just a few niggly points though (from my own experience using these types of things):

    (1) with many users that 4GB size limit’s going to be a migraine … see if you can’t use longs instead for indexing. Or even add a 2nd key as file# so you can add numerous 4GB files as the blob storage.

    (2) That compact idea sounds good, though it tends to need to be used more often than not (especially if point 1 above isn’t followed). And this is very time consuming, making your performance drastically slower since no clients may be served while the compact is running.

    (3) Related to point 2 you end up with free-space fragmentation. Especially since your lengths are variable and they need contiguous space you’re basically inventing a disc allocation system (similar to FAT). Even if you then state any text will only be in groups of X bytes with a field stating the next group (if any) you end up with the exact same issues FAT had with exorbitant fragmentation. Remember that this is on top of the possible fragmentation from the OS’s disc allocation systems also.

    (4) To avoid the OS’s fragmentation you might want to allocate one complete file (i.e. 4GB at once) and then keep track of empty space yourself. Growing and appending to files is a sure-fire way to make the OS fragment your 4GB file into 100′s of little portions scattered throughout the disc.

    I’d definitely rather look at implementing it through a decent DB using text blobs instead. Something like SQLite should be adequate for not too many clients, else MySQL/Postgre/Firebird I’ve found to be very decent in performance even with many many concurrent users.

  2. BY Richard Morgan says:

    A simple solution is to use one of the oldest systems (base date is in 1969) called Pick or Universe or Unidata. This environment, which runs on several varieties of Unix and/or Linux, has the solution already built in. Files in Pick (everything that follows applies to all 3 varieties) do not need indexing to directly address the data. Pick files are hashed, which means that the system can directly access any record by its ID without needing any kind of index. Records in Pick files are variable length, as are data fields with the record.

    If you assign an ID that is the most heavily used key to the data record, it will pull up a data record with a single disk access. For basic name access, many systems use a key that is 3 letters of the last name, 4 letters of the first name, and a numeric tie-breaker.

    Pick systems also have indexes, which are handy if you want to pull up records by alternate keys. You can also search whole files with a command line similar to SQL. Programming in Pick is in a very nice extended variety of Basic, which treats all data as variable-length strings.

    Universe has a neat feature that gets around the file size limits of Unix. You can create a file type that is actually a list of files. Universe treats the list of files as one big file, but you can also access the individual files directly.

Post a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>