Archive

Archive for September, 2013

A small adventure in RTF

I have need of fixing the RTF comments contained in some 48,000 records in a database. These were imported from another database some years ago, and the original data is no longer available. The import process damaged at least some of them, generally by copying into a too small container, thus truncating the RTF script. On my form, I have placed a TMemo (memPlain) and a TRichEdit (memRich). I fill them with this code:

procedure TfrmMain.FillMemoes;
var
  str: TMemoryStream;
  s: string;
begin
  str := TMemoryStream.Create;
  try
    s := dsLookup.DataSet.FieldByName( 'vComment' ).AsString;
    Label2.Caption := s;
    memPlain.Lines.Text := s;
    str.Write( PChar( s )^, Length( s ) );
    str.Position := 0;
    memRich.Lines.LoadFromStream( str );
  finally
    str.Free;
  end;
  if not chkDisableRepairs.Checked then
    MakeRepairs;
end;

The MakeRepairs procedure is simple:

function TfrmMain.MakeRepairs: Boolean;
begin
  Result := False;
  if CheckPlainAndFix then
    Exit( True );
  if CleanRTFAndFix then
    Exit( True );
end;

A little function to make things a bit less cluttered:

function TfrmMain.HasRTFDelims(const text: string): Boolean;
begin
  Result := ( PosEx( '{', text ) > 0 ) or
  ( PosEx( '}', text ) > 0 ) or
  ( PosEx( '\', text ) > 0 ) ;
end;

At this point, I must say that I had gotten ahead of myself. I wound up abandoning CheckPlainAndFix, and instead added some visual tools to let me learn about the actual problems in the data. What I learned changed my direction entirely.

There are 48,799 records in the database. Of these, it turned out that RTF was damaged in 59 of them. Annoying, but a much smaller incidence of problems than my client thought existed. My suggestion was that I could simply manually correct these few, and save the updates. It made little or no sense to code a solution, both because so few records were involved, and because among that small number, there were several different pathologies observed, at least one of which would have required a good deal of experimentation to resolve.

My client countered with the decision that we would make no repairs. These data are a few years old. It may be that the users will never open them, and as the reporting from the app is done for only the current year, they will have no impact there.

Lessons (re)learned:

  1. Next time I am told that we “have a big problem” I shall go no further than to measure the actual magnitude of the problem.
  2. Coding any sort of fix, however minor, for any sort of encoded stream prior to completing step 1 is just silly.

Those points are pretty fundamental, and I had certainly learned them years ago. But in this case, my client told mew a) that the damage was widespread and b) that he had tried to apply some repairs in SQL, but will little success. That said to me that he had done some analysis, and that it was a substantial problem in need of a clean solution. However, as I tripped over issues in debugging my code, I began moving toward trying to quantify the problem, and if necessary, to quantify the categories of pathologies, in terms of the difficulty it might involve to code a repair. With a total of only 59 damaged records, very little coding would have been justified.

In the final analysis, the only routines with value in my little project were:

  1. The small routine which recognized RTF delimiters in the visible text of the TRichEdit.
  2. The code I added to count the number of damaged records.
  3. The visual items added which let me see the list of damaged records and click on each ID to reload the TRichEdit, allowing very rapid determination of pathologies.

Therefore, I have not developed any sort of RTF code repair tool. The need may surface someday, or I may decide to pursue it on my own, for the experience, but my client has no need of it.

On the other hand, one of the discoveries I did make was that a very small number of users had copied and pasted from Word to the app which originally managed the data. As might have been anticipated, Word exports in RTF (just as it does in HTML) a rather large number of elements which it would be nice to remove. Dozens of RGB color specifiers, for example. Now that may well be a project for me do undertake at some point. Even in the records affected in this way, the colors had not been used, so there is no reason whatever to retain them. It is just MS-bloat.

Advertisements
Categories: Uncategorized