Translate

Thursday, August 15, 2013

Soundex Explained


Soundex Explained

 
NOTE: This is an update to an article I wrote many years ago. It has had some new information added and several new links to new web pages have also been added.

Many genealogy records are indexed by a high-tech algorithm called the Soundex Code. Well, it was “high tech” in 1918 when Robert Russell invented it. In a nutshell, Soundex Codes provide a means of identifying words – especially names -- by the way they sound. They were used extensively by the U.S. Work Projects Administration (WPA) crews working in the 1930s to organize Federal Census data from 1880 to 1920. Soundex has also been used for many state and local census records and is very popular in genealogy software and databases. If you are a genealogist, it won't be long until you encounter Soundex.


Motor vehicle bureaus in the District of Columbia, Maryland, Michigan, Minnesota, and Missouri employ Soundex for generating the initial characters of the identification numbers on driver's licenses. The Canadian Centre for Justice Statistics uses Soundex to encode names in its crime surveys and maintain the anonymity of individuals about whom data is collected.

In the days when nearly all of the data for the Census of Population was collected by actual enumerators and individuals who walked from door to door, it was discovered that many of these people spelled surnames phonetically. Thus, one might spell Smith as "Smith" while another might spell it as "Smyth" and still another "Smythe." The census records were to be indexed by the sound of each name rather than by its spelling, and Soundex was the code system used to organize this index.

If you search many records of interest to genealogists, sooner or later you will need to use Soundex Codes. Why? Well, you can often find a person’s entry by his or her Soundex Code, even when the names have been misspelled. This becomes important when you realize that many census takers did not speak the language of the people being enumerated. In fact, in the first 150 years of U.S. census records, the majority of Americans were illiterate and did not know how to write their own last names. The spelling of many family names also has changed over the years, but often the Soundex Code remains the same.

Spelling of names varies widely in early records, especially when language difficulties have intervened. For instance, I could not find my French-speaking great-grandparents listed in the 1910 U.S. Census. I searched and searched, but never found any entries for Joseph and Sophie Theriault. Then I decided to do a Soundex search. The Soundex Code for Theriault is T643. When searching for Soundex Codes, I found several entries for T643 in Ashland, Maine, including one for the family of Joseph and Sophia Tahrihult -- improperly spelled, but with the same Soundex Code.

The census taker had a Scottish name, and he was listed on another census page in the same town as a being born in Scotland. I am guessing that he did not speak French. I bet he had some difficulty when speaking with my great-grandparents, neither of whom spoke English, and neither of whom could read or write. The census taker undoubtedly spoke with  thick Scottish accent. No wonder Theriault became Tahrihult!

The Soundex Code is not difficult to learn although I still use a small reference card when I go to the archives to look at records. Every Soundex Code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname, and the hyphen is optional. The numbers are assigned to the remaining letters of the surname according to the Soundex guide shown below. If necessary, zeroes are added at the end to produce a four-character code. Additional letters are disregarded.

Here is the Soundex Coding Guide:

Each number represents letters:
1 = B, F, P and V
2 = C, G, J, K, Q, S, X and Z
3 = D and T
4 = L
5 = M and N
6  = R

Disregard the letters A, E, I, O, U, H, W, and Y.

Here are some of the simpler examples:

Washington is coded W252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).

Lee is coded L000 (L, there is no Soundex Code for E, so the numbers 000 are added).

Now let’s move on to some of the more complex rules:

Any double letters in a name are treated as one letter. For example:

Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).

If the surname has different letters side-by-side that have the same number in the Soundex coding guide, they are treated as one letter. Examples:

Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).

Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

Names with Prefixes

If a surname has a prefix, such as Van, Con, De, Di, La, or Le, the code should ignore these prefixes. However, coders sometimes miss this rule, so they might assign the Soundex code either with or without the prefix. Because the surname might be listed under either code, a thorough search of the Soundex index should include both forms. Note, however, that Mc and Mac are not considered prefixes, according to the National Archives and Records Administration. Once again, however, not everyone knows this particular rule, so you might want to search both with and without the Mc or Mac coded.

VanDeusen might be coded two ways:
With the prefix included, V-532 (V, 5 for N, 3 for D, 2 for S)

or

With the prefix excluded, D-250 (D, 2 for the S, 5 for the N, 0 added).

Consonant Separators

If a vowel (A, E, I, O, U) separates two consonants that have the same Soundex Code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

The “H & W Rule”

If "H" or "W" separate two consonants that have the same Soundex Code, the consonant to the right of the vowel is not coded. Example:

Ashcraft is coded A-261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.

The Soundex Indexing System web page on the National Archives site has been updated to include this previously "lost" rule. Not all documents use this extra rule, however. Use the National Archives' Soundex page as your definitive source. The genealogical community owes a special thanks to Tony Burroughs, who researched and rediscovered the original Soundex instructions used by the Census Bureau.


American Indian and Asian Names

A phonetically-spelled American Indian or Asian name was sometimes coded as if it were one continuous name. If a distinguishable surname was given, the name may have been coded in the normal manner. For example, Dances with Wolves might have been coded as Dances (D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as Shinka (S-520) or Sa (S-000).

While the rules sound a bit complex, they do become easier with a bit of practice. For those of us who are too lazy to go through the coding exercise, the computer age has brought many new tools. Most modern genealogy programs will tell you the Soundex Code of any name that you enter. In addition, a number of online Soundex Machines are available, including those at: http://www.eogn.com/soundex and http://resources.rootsweb.ancestry.com/cgi-bin/soundexconverter On any of these sites, you can type in a last name, and the site will display the correct Soundex Code. Yet another Soundex Converter (YASC) at http://www.bradandkathy.com/genealogy/yasc.html will even convert a long list of names to their Soundex equivalents; you do not have to enter them one at a time.

NOTE: You can find many more Soundex converters online, but many of them do not follow the "H & W" Rule. To test them, enter a name of Ashcraft. It should produce a Soundex code of A-261. If the software produces some other code, don’t trust it.

The National Archives and Records Administration publishes the definitive web page explaining the coding of Soundex at http://www.archives.gov/genealogy/census/soundex.html

While Soundex is a great tool and in widespread use, it certainly is not perfect. For example, it fails for names that sound the same but have different first letters. For instance, Knowles is coded as K542 while both Noles and Nolles are N420. Likewise, Cantor is C536 while the similar sound of Kantor is K536.

Soundex also has a number of shortcomings when dealing with Eastern European Jewish names. Two Jewish genealogists, Randy Daitch and Gary Mokotoff, developed a more sophisticated system, more suitable for Jewish genealogy. The Daitch-Mokotoff Soundex is becoming the de facto standard for on-line lookups on Jewish-related web sites. You can read more about the Daitch-Mokotoff Soundex in an article written by Gary Mokotoff at
http://www.avotaynu.com/soundex.html

Numerous other improved Soundex methods have been developed in recent years and are in widespread use on numerous computer databases. The accuracy of the newer methods is much improved. These new and improved Soundex systems typically use more than one letter and three numbers. However, they have never seen much use in genealogy databases.

Now, have fun with census records!

 

The entire link on the subject can be found at:  http://blog.eogn.com/eastmans_online_genealogy/2010/08/soundex-explained.html

 

Posters comments:

There are many links on this subject, to include online soundex calculators.

The point of the story is well taken. I have a grandmother with a maiden name of Creswell, but the tombstones at the Creswell Family Cemetery are also spelled Cresewell, and Chriswell.

No comments: