Part 1: Data Acquisition and Conditioning
I'm a family history buff. I manage a tree on ancestry.com, synched with the desktop app Family Tree Maker, or FTM. This gives everyone in my family access to the tree online, while I use FTM for ease of maintenance and for publishing charts and reports. But of course, I'm also drawn to Excel to customize the data presentation. I built a report in FTM to export data to a .csv file to get me started in Excel.
The concept in mind is a timeline, showing lifespan bars across year columns. FTM has a color-code system to identify ancestral lines, so I'll carry that over. I'll also highlight columns with notable public events and leaders, for context. Finally, I'll add slicers for easy filtering.
Whenever I export or download data from one app to another, I am very conscious of the risk I'm taking; is this a one-time trip? will I need to maintain the data in two places? can I make the export/import process routine? As much as possible, I keep the downloaded data in the same format and contiguous range it came in, so that it can be updated. Of course, over time the number of rows will change, but as long as the columns match, I should be able to repeat the process efficiently. Typically, I will just copy and paste from the csv download to the processing table.
My first task then is to build a report in FTM with the column set that will satisfy my needs now and in the future. I pick nine fields: name, sex, birth date, birthplace, death date, death place, father, mother, spouse. With this report open in Excel, I format it as a table and begin inspecting the data.
Building the Import Processing Table
By looking at the data I begin to get a sense of what columns I'll need to add to get consistent output for graphing, and as the wheels turn I think of another output table as well to display generations for a select individual. The import table then is for processing the raw data into usable format. This I break into two stages, cleaning and enhancing.
Cleaning refers to normalizing of the imported data, which entails modifying both names and dates; names will be made unique, and dates will be converted from a variety of formats into a consistent one. Normalized dates will then be used to filter out names presumed deceased or who deceased as minors. A parameter table will be used to control cutoff ages.
Enhancement adds value to the data by categorization in both name and date dimensions. Names will be identified by the "trunk line" of lineage, and dates will be grouped by "era", as the century of death, or "living" if under a maximum age.
Cleaning the Data
Handling Dates Before 1900
To begin the cleaning, the first thing that stands out is that dates are in a variety of formats. My dates range from year 1210 to present. FTM is very forgiving about dates, Excel is not. FTM accepts year only, year and month, or year, month and day as a date. Excel dates must have all three - given only the year where a date is expected, it will interpret the number as "days since Jan 1, 1900". Any date before that cannot be interpreted as a date. This seldom-encountered limitation is immediately apparent when working with historical data. (For a comprehensive explanation and solution for handling such dates, see How to Work with Dates Before 1900 in Excel - ExcelUser.com.)
I decided that, for the purposes of my graphical timeline, I will only use years and not bother with actual dates, but I still have to normalize the various dates as imported to get accurate year. Some of them are already 4-digit year; some are "unknown"; full dates >= 1/1/1900 are Excel date serial numbers, and full dates < 1/1/1900 are text. This requires a conditional formula to return a numeric year in every case.
Text Dates
The dates in my data that are interpreted as text (e.g. Feb 27, 1658) always end with 4-digit year. Once these are identified by ISTEXT, I'll use VALUE(RIGHT(date,4)) to convert to numeric year.
Excel Dates and Years
Any date that is not text must be either a 4-digit year (e.g. 1629, 1784, 1932,...) or a full date >=1/1/1900 (e.g. 11-May-54). The tricky point here is to distinguish between these two. For example, if the date is simply 1954, I can see it's a year, not a serial number, but the formula must catch the difference. If I call YEAR(1954) I'll get 1905. Fortunately, the import has formatted the dates correctly: years as "General" and dates as "dd-mmm-yy". So I can use CELL("format",date) to get the format code. If the code is "G", take the date-as-year directly; if not, convert by YEAR(date).
Blank or "unknown" Dates
At first I thought to ignore these, but later added a condition to give them a "dummy year" value. The reason for this is explained below. The method is to assign a constant, "unDate", set = 3000, and add ROW(). This dummy number is hidden from view to avoid confusion, by a conditional number format to display "un". It is also later filtered out from the timeline. Its only purpose is to build unique names.
Birth year formula:
Building Unique Names
As is common in family trees, in my data are many duplicate names, with up to eight individuals sharing the same name. In order to be able to refer specifically to each individual name, I devised to suffix duplicates with a number. To come up with a unique number, I use date order. I can safely assume that there are no two identical names born the same year.
To get from "NameOriginal" to "NameUnique" is then a three-step process, which I lay out in three columns for transparency.
In step 1, [ct] returns how many of this row's name are in the name column.
In step 2, [id] builds an array string by FILTER of the [Birth] years for this row's name, converted to text by ARRAYTOTEXT. This holds a string of comma-separated years. Each year takes six characters including 4-digit year, comma and space. The "informal" array string will thus look something like 1658, 1718, 1735, etc. The FIND position of the current row's year in that array will be 1, 7, 13, etc. Thus, 1+(find-1)/6 gives ordinal 1, 2, 3, etc.
Finally, step 3 uses the previous two steps to suffix the name with [@id] only if there are multiples of the same name.
This image shows how [ct], [id] and [NameUnique] are connected. Also shown are [Status] and [Filter] which work to exclude those deceased as minors or presumed deceased.
Enhancing the Data
Two columns serve to group rows and will feed slicers. The groupings are maintained in configuration tables and retrieved by XLOOKUP.
Lineage trunk lines are a feature of FTM color-coding, see below. By selecting an individual, you can color-code all of that person's ancestors, and descendants of those ancestors, in four colors, each representing one of the home person's grandparents. This is very helpful to quickly see which side of the family each name is on, and will be essential in the graphical timeline, as seen in the image at top.
Unfortunately, the FTM colors, or the trunk line ids they represent, are not exportable. I spent some time trying to find a way to replicate the logic in Excel, to no avail. It's theoretically possible, practicality is the question. The game is afoot, but for now, I bit the bullet to manually copy the colors as numbers. FTM displays them in the name index, which matches my Excel sheet exactly, so the process was reasonably painless. I've submitted an enhancement request to the FTM publisher to make the line exportable, however I will eventually need to confront the challenge in Excel anyway, to enable selection of the home individual without going through the whole download again. Sometimes you've just got to move on.
[Line] looks up the [@grp] value manually copied from FTM to represent the color code, and returns the line name, representing the birth surname of the home person's four grandparents. For grandparents and above, [grp] is 1, 2, 3, 4. For example, 1 = father's father (and his ancestors), 2 = father's mother, 3 = mother's father, 4 = mother's mother, etc. For the grandparents' descendants it will be a combination, such as 12 for aunts and uncles and cousins on the father's side, 34 on the mother's side, or 1234 for the home person and their siblings. The names of these combinations are made up as the first letters of the surnames included, prefixed with"x".
[Era] looks up death year, if found, or birth year if not, and returns the century, or if status is not deceased, "living".
All this preparation work of course does not come at once with perfect forethought, but over iterations of trial and error.
Coming Soon
In the next episode (spoiler alert!), I'll present the timeline structure and formatting, including (but not limited to) dynamic columns (years or decades, optional start year), slicers, Conditional Formatting of course, and the integration of public history with family history. This will fulfill the promise of the teaser image at top.
Part 3 will cover another output table that displays eight generations of ancestors (that's 510 rows, in case you're wondering) for a selected individual.
Stay tuned!
Comments