Relational Databases, The Good Parts: Part 1

Every successful technology has good ideas. Raw merit alone however does not propel technologies from just being good ideas to becoming widely adopted. Careful strategy and haphazard social forces both play a part in deciding what technology will emerge as the leader and how long it will stay on top.


For every good idea, there is a bad idea. In truth, ideas can seem great to some people or now and turn out to be horrible ideas in the broader scope. Great ideas are often formed by cobbling together many good ideas. To an extent this approach can work well, but eventually the growing complexity hits a critical mass. People are different, and have different tolerances for complexity. Some find certain forms of complexity appealing and challenging. Others find the same complexity bothersome and confusing while valuing different complexities. One man's toil is another's delight. People learn what they are taught, what works for them, and they learn to take the good with the bad.

In Javascript: The Good Parts, Douglas Crockford explores and exposes the best parts of a mostly good idea: JavaScript. Like so many other good ideas, finding the good in JavaScript requires the right perspective.

The aim of this and the other posts in this series is to explore the good ideas of Relational Databases and expose how to most elegantly harness the technology regardless of which of the many "flawed" implementations you might choose to pursue.

I say "flawed" because none of the current commonly used relational database technologies strictly adhere to the Relational Model. Many deviate intentionally and for purely practical reasons. Most stakeholders wanted features that go against the grain of the pure theory. This is not bad by necessity. Commercial and Open Source projects implement features that give a competitive edge, even when such an edge is short sighted. Pure theory can be a bad idea when it doesn't best serve the needs.

As with many things, people have to make decisions for themselves. The right choice is subjective. This series simply aims to layout one clear path through the myriad of options and make implementing a relational database. The goal is the cut through the maze and provide a clear method.

The Relational Model In a Real World

One of the challenges of using any technology well is finding elegant ways to interface it with reality. The Relational Model at a pure level is abstract and mathematic. Many early implementations were interactive in nature, providing a direct interface to the relational model on its own terms. This direct interaction between user and technology is one of the best features of relational databases, but also lead to some very problematic developments.

Well designed databases using the Relational Model, from a technical perspective, are said to be normalized. There are very specific logical standards to the process of designing normalized databases.

From a human perspective, normalization tends to increase the complexity of a database. Normalizing a database typically involves breaking down into more relations. Working directly with normalized databases requires a greater level of expertise with the technology and more abstract thinking. This conflict between making databases easier to interact with and technically correct implementation is at the center of much of the ugliness of modern database technology.

Relational databases should be interactive, but they should also be normalized. A de-normalized database is little more than a spreadsheet, a useful technology in-and-of itself.

The elegant solutions to the normalization/usability dilemma this series of posts will explore are solutions that focus on abstraction through other technologies to provide both ease of use and ease of development.

Building Blocks, The Relational Model

The core technology this series will focus on is the Relational Database, which logically is based on the Relational Model.

As succinctly as possible, with the possibility of over simplifying the concepts, it is possible to define the relational model as:
  • A set of uniquely named database variables, each being:
    • a set of uniquely named relation variables, each being:
      • a set of n uniquely named attributes, each being:
        • exactly one type of data
      • a set i of tuples, each being:
        • a set of n data variables. In each tuple there is exactly one variable referencing each named attribute. The value of that variable must be of the data type for the referenced attribute.
      • a set of r uniquely named keys, each being:
        • a set of attribute constraints. Each set is either a uniqueness constraint (local) or a constraint defined by tuple values associated with attributes in another relation (foreign constraint).

diagram of the entities in the relational model as described above.Something to note with the definitions above is that a system using the relational model may have only a single database rather than a set. Many common uses of relational databases focus on a single database, but it is important to be aware that implementations spanning multiple databases exist.

Additionally the databases, relations, tuples, and data in a database are said to be variables. There may be some variables in a particular relational database that should not or cannot be changed but for the purpose of this overview of the relational model all three of these things are variable, meaning they can change over time.

Of particular importance in this definition is the notion of types. Each discrete piece of data in a tuple has a type, but tuples also have a type which. The type of a data variable is dictated by the type in the referenced attribute. The type of a tuple value is dictated by the set of attributes for the relation. Relations and databases too can be said to be typed. The attributes that dictate the types of tuples also dictate a type for the relation. The set of relation types in-turn dictate a type for the database. This system of typing will be explored in more detail in following sections.

Finally, the notion of uniqueness in naming should be explained as to say that no two databases may have the same name, and within each database no two relations may have the same name, and so on. Each entity in the relational model can be said to have a fully qualified name that includes the entities that contain it.

Thus two relations with the same name, but in different databases are allowed because the fully qualified name of each would include different database names. Similarly, two attributes in different relations may have the same name, and an attribute may have the same name as a relation. Because keys and attributes are both contained by relations, it is highly counterindicated to use the same name between the two in the same relation. Some database implementations may allow it, but it is a bad idea.

The terminology used above is in following with terms defined by the Relational Model. Relations are commonly referred to in database terminology as "tables". Attributes are commonly referred to as "columns" and tuples as "rows". The table-column-row model of database design is certainly useful when thinking about relations conceptually, but can be misleading if you take the analogy too far.

Normalized Relational Databases

There are, by-in-large two main considerations that go into designing relational databases: normalization and practical use. A highly normalized database has two key benefits: reduced redundancy and less potential for inconsistency. The downside to a highly normalized database is it can be more complex to access useful information directly. It is not uncommon to intentionally create databases that are to some extent de-normalized, making them more practical for people to directly access the information stored in the database interactively.

This series comes from the perspective that "simple" is a subjective quality. The normalized relational databases that arise from the approaches described will be more "complex" to the naked human eye than a practical de-normalized database, but more powerful and elegant to the human eye when focused through appropriate technology.

The methods described will not only create highly normalized databases, but to also normalize naming and design conventions in a way that does not limit the raw power of the relational model, but rather tames it. Using well established technology outside the scope of the relational model, each successive post will reveal a more elegant perspective on relational database technology.

Value Types

The relational model prescribes a specific structure of information: databases containing relations as described above. The relational model provides a way to store and represent information in the form of facts, truths.

Without getting to far on a tangent, it is important to note that relational databases can model the same kinds of information possible in many other models. These models include prose text, xml, binary files, and just about any information representation imaginable. These different models each have different traits that make them better suited to different applications.

Many primers on relational databases describe database design in terms of Entity-Relation modeling and diagrams. These techniques can be useful, but for the purpose of this discussion a more basic type based exploration will be employed. As the name implies, Entity-Relation modeling implies its own model which is not always held to be the same thing as the Relational Model.

Each variable in a relation's tuple references a value. The value must always be of the type specified by the corresponding attribute in the same relation.

To keep things as simple and powerful as possible it will suffice to define these types as a set of possible values. The value referenced by a variable of a type, must be a value in that type's set of possible values.

Again to keep things simple, we will define a value to be any possible value: a number, a boolean value, a string, a complex data type such as an array, or even a relation, tuple, or database. There are a great many implementations of the relational model, each with its own specific limitations as to what a value may be.

Key Types

The values associated with each variable of a tuple are constrained to the set of values for that variables data type. This type comes from the an attribute of the relation, each tuple has exactly one variable per attribute. It is important to consider the use of the word 'set' in the previous definition. Sets are unordered, and each member of a set must be unique in that set. For the named entities in the relational model: database, relation, attribute, and keys; the name alone satisfies this uniqueness requirement. There may be two databases that are functionally identical because their name will always ensure uniqueness. There may also be two attributes in the same relation that associated with the same datatype because each will have a unique name.

Tuples are the singular entity in the relational model that do not have an inherent unique identifier. The variables within each tuple are uniquely associated with an attribute of the relation. The tuple itself only has one way to be uniquely identified, itself. For this reason, each tuple in a relation must be unique. No two tuples may be comprised of variables associated with the exact same values.

As mentioned in the definition above, keys come in two varieties: local and foreign. Local keys designate one or more proper subsets of the attributes of the relation which must also have unique values for every tuple. Each local keys restrict the possible tuple values and create a functional dependence between attributes not in the key on attributes in the key.

Note that the definition of a relation creates an implicit "key" on the set of all attributes, since each tuple must be unique. If a relation has an explicitly defined key on one attribute, no more than one tuple in that relation may ever have each of the possible values for that attribute. Other attributes for which there is no key may have duplicate occurrences of the same value, both collectively and each individually.

A key may include more than one attribute, when it does it does not constrain values in each attribute, but rather the pair, trio, quartet, etc. of attributes. That is to say, if a there is local key on a single boolean attribute the relation may have exactly two tuples: one for key "0" and one for key "1". It there is a local key on a pair of boolean attributes the relation may have four tuples with keys: "00", "01", "10", and "11". This is not the same net effect if each attribute was in a key by itself. In this case there could be any two of the previous four pairs. The existence of the third would always logically violate the uniqueness constraint on one of the two attributes. Of "00" was the first tuple defined, the only other tuple could be "11" since "01" would violate the uniqueness of the first column and "10" would violate the uniqueness of the second.

Attributes may appear in more than one key, and though keys are named, and therefore not required to be functionally unique there is no reason to have non-functionally-unique keys.

Foreign keys are defined as:
  • a set of n attribute pairs where:
    • Each pair must have exactly
      • one attribute in the "local" relation and,
      • one attribute in some other relation, a "foreign relation".
      • Both attributes in each pair must be of the same type.
    • The foreign relation for every pair must be the same relation.
    • The set of foreign attributes from all pairs is exactly the set of at least one local key in the foreign relation.
    • No attribute, local or foreign, may appear in more than one pair.
The Relational Model does not explicitly prohibit the foreign relation being the same as the local relation, but this is one case where good design counterindicates such a practice.

Just as local keys enforce an additional uniqueness constraint on values associated with the attribute (on top of constraints from that attribute's type), foreign keys also enforce an additional constraint. Instead of requiring that values be unique, the foreign key requires that values associated with the attribute be in the set of values of the foreign relation. As an example, if the type of the attribute is boolean as in the earlier example and the foreign relation has only a "1" value, the foreign key would require that the local relation only have a "1" value. If that attribute in the local relation was also a local key, the relation could only have one tuple at all. If the attribute was not also a local key in the local relation, the relation could have multiple tuples, so long as each tuple was unique and no tuple had a value for that attribute other than "1". Foreign keys limit the value in the local attribute to the set of values associated with the foreign attribute. This is always a subset (though not always a proper subset) of the attributes type.

The Relational Model, and many database technologies draw a distinction between different types of what this series will call "local keys": primary and candidate. These can be semantically important when working directly with specific database technologies, but the prescribed design practices will make distinction unnecessary aside from implementation specific quirks.

Just as relations have a type as defined by the set of attributes, keys too have a type based on their own set of attributes. The type of a foreign key includes the name of both the local and foreign relations. Key types in-and-of-themselves are not an important consideration in the design practices that will be discussed, but they give rise to another type that is central: Correlation Types.

Correlation Types, Direct

In relational databases, everything is connected to some extent. The most basic correlation is between all the information that is explicitly stated. All of the information represented is said to be "true". Information not represented is assumed false. Information in each database, at the very least, is related by its presence in that database. Each database is a different context of some sort. Presence in one database, but not another says something about that information, though exactly what is says hinges largely and subjectively on the designer's intent. Two different databases may represent the exact same types of information, entirely different types with no overlap, or some compromise between these two extremes.

The strongest form of correlation in a database is the direct correlation, a one-to-one mapping. Direct correlations come in two varieties: those formed between functionally independent values in a tuple, those formed between functionally dependent values in a tuple.

In the relational model, the attributes of a relation variable define one-to-one (reciprocally required) relationships between two value types. This implicit relation is the only acceptable method of creating a one-to-one relationship directly between two values. It is also conveniently the only method required.

Mathematically speaking, each tuple of i attributes defines ((i-1)^2 + i-1) / 2 direct correlations. In relations with no explicit keys, all of the values are said to have an independent correlation with each other. This means that any combination of any of values in the tuple that otherwise conform to the associated attribute type constraints are allowed.

Relations with at least one key however create a different kind of correlation between the set of values participating in the key, and the set of values not participating in the key. This distinction hinges on the notion of Functional Dependence, but to keep the definition simple for the purpose of this series an assumption will be made:

To simplify design decisions, it is helpful to consider that in a highly normalized database it is rarely the case that any key with more than one attribute is needed. It follows to infer that, as much as possible, the design methodology this series describes will avoid keys using multiple attributes in a key. With this assumption in mind, rather than saying one set of values of functionally dependent on another set of values, it is possible to say one ore more individual values is functionally dependent on one or more single values.

A functional dependency between two values means that the value of one in some way constrains possible values within the same relation for others. In a relation, if value B is functionally dependent on A then the following statements are true:
  • A is a local key, thus there are no duplicate values for A.
  • Because there are no duplicate values for A, there is exactly one value for B for each value for A.
The reverse cannot be assumed from B being functionally dependent on A. B is not necessarily a local key, there may be duplicate values for B, and there may be more than one value for B for each value for A. When B is functionally dependent on A the correlation from B to A is said to be a dependent correlation. Unless A is also explicitly functionally dependent on B, such as the case where they are both local keys, then A retains an independent correlation to B.

This notion of correlation is specifically directional, meaning that while the correlation may be reciprocally identical for both participants in some cases, there are other cases where the nature of the relationship for each participant is different. For direct relationships between values in a tuple two values may be:
  • mutually dependent, or
  • one may dependent on one that reciprocates the relationship independently, or
  • they may be mutually independent values.
As one might imagine indirect correlations are also possible and come in a variety of forms, but they all hinge on the use of a foreign key. Indirect correlations are also directional. All direct correlations create one-to-one relationships, indirect correlations allow many other types of relationships between the values they connect.

Correlation Types, Indirect

When two values are in the same tuple a direct correlation occurs. Two values in different relations that are related to each other through one or more foreign keys have an indirect correlation.

The interplay between the attribute that creates a foreign key, and the local key in another relation it points to creates a different kind of constraint than functional dependency. Take the following entities into consideration:
  • Attribute A with value a in Relation X, which is a local key
  • Attribute B with value b in Relation X
  • Attribute C with value c in Relation Y, which is a local key
  • Attribute D with value d in Relation Y
The direct correlation (ignoring the local key for now) between A and B ensures that for every a there is exactly one b. This is from the independent correlation that occurs by cohabitation in the same tuple.

The local key on A adds further constraint from dependent correlation such that for every a there is only one value for b. There may be multiple values for a for each b however.

The same two statements above also apply to C, c, D, and d. as the relationships mirror those of A, a, B, and b respectively.

So far, nothing can be said about any relationship between A and C, A and C, A and D, or B and D.

There is nothing mutually exclusive about attribute's ability to be a local key and a foreign key. Attributes may be either, neither, or both. The basis for different indirect correlation types and the functional effect they have on relationships between values focuses on the following two conditions:
  • Whether the foreign key is also a local key or not, and
  • The foreign key's significance as a local key (when applicable).
The first condition should be clear by now, either an attribute is or is not a local key. The effect of being a local key on an attribute is also clear: it means no two tuples in the relation may share same value for attribute. The second condition presents more of a challenge.

In the formal Relational Model, many authors discuss two forms of local keys: Primary keys and candidate keys. Technically speaking, the distinction is arbitrary. Implementation specific quirks aside, there is no functional difference in the relational model between primary and candidate keys. This is precisely the reason these two terms are not used elsewhere in this series. If a pure theory approach is taken, it suffices to say that all "local keys" are candidate keys and if one is assigned as a primary key arbitrarily there is no technical significance to that selection.

This goes back to the earlier point about difference in the pure theory, and human friendly design choices. There is sometimes disagreement about why normalization is good, bad, or important, or undesirable. Reasons for normalization include both robustness and economy of storage. The compromise this design process will exemplify considers the following from most to least important:
  • Normalization towards making the database more robust is important.
  • For the most part, the relational database and any complexities introduced by adhering to sound theory will be abstracted, so high normalization is good.
  • Some human will eventually or occasionally have to directly interact with the relational database.
  • Normalization as a means to drive down storage requirements is a lower priority than the above.
This is not to say that storage requirements are not important. The normalization used will help keep the storage requirements low. The design process does however necessitate that:
  • for every relation there is a primary key which is a single attribute, and
  • unless there is a significant and deliberate reason to identify a natural key (being a value inherently part to the data being stored)
  • the primary key will be an artificial key.
The reason for this design decision goes beyond considerations discussed so far and will be covered in future posts. Suffice to say that the artificial primary key will introduce additional storage overhead, but it makes the abstraction easier, it makes direct interaction with the highly normalized database easier, and it does not compromise the robustness of the database.

As mentioned, by default the assumption is that for every relation there is a primary key comprised of a single attribute. For most relations the safe assumption is to assign an artificial primary key, and many relational database technologies make this easy. An artificial key is a value that is part of each tuple but does not functionally change the information the tuple represents.

It has already been mentioned that in the relational model tuples alone lack a unique name. Databases, relations, attributes, and keys are all named uniquely. The notion of a local key is vaguely analogous to a the unique names available to other entities in the relational model. Any local key for a relation can be used to select exactly one tuple uniquely. The notion and perceived importance of a "primary" key is probably deep seated in the human tendency to name and label things.

Sometimes the information represented by a tuple inherently includes a "unique identifier". The identifier is part of the information. When this unique identifier is truly unique, meaning no other distinct tuple in the relation could ever logically use the same identifier. This means all the other information in the tuple is functionally dependent on the identifier. It stands to reason then that it would be desirable to make that identifier a local key, and as it turns out this is exactly the right design decision.

The problem however is that while this part of the information may always be unique, the decision to make it an identifier is a bit tricker. Uniqueness is a semantic property that can safely be relegated to local keys. Identity on the other hand presents a challenge: could the value with the identity change? The semantic property of identity can take on different meanings depending on the intent of the designer. Exactly what is meant by "semantic" will be covered in the next post, but for now it is important to simply state that from a design perspective it is undesirable for the identity associated with a tuple to ever change. For this reason the assumption is that the identifier attribute chosen (the primary key) for each relation is one that is arbitrary assigned by the database and is never changed.

Designers may explicitly decide that a primary natural key is appropriate for a relation. There are certainly cases where this choice makes sense. It is however a choice that should be both deliberate and carefully made. Beyond this point this series will only refer to primary keys as identifiers when they are artificial keys and natural identifiers otherwise.

To return to the reason for this explanation of the significance of a local key to its relation, identifiers and local keys that are not identifiers have slightly different meanings. Both convey the exact same logical constraints on a values, what distinguishes the two is the designer's intention behind the correlation they create. The three basic indirect correlation types, and variants of each are the focus of the remainder of this post.

Super Correlation

The first correlation type is the Super correlation. Take as an example slightly modified versions of the previous relations X and Y:
  • Attribute A with value a in Relation X, which is an identifier for X
  • Attribute B with value b in Relation X
  • Attribute E with value e in Relation Y, which is an identifier for Y and foreign key to A
  • Attribute F with value f in Relation Y
Note, the definition specifies that E is also a foreign key to A. This definition means that each value e must be on of the values a, then it can be said that X is a super of Y or that X is extended by Y. This correlation is so defined because for every tuple in X there may be exactly one or zero related tuples in Y. In this way,
  • Y relates additional information to some tuples of X, but not necessarily all tuples of X
  • there may be no more or fewer than one related tuple in X for each one in Y
  • there may be no more than one related tuple in Y for each one in X
  • thus, the super correlation's cardinality considered is a one-to-zero-or-one.
For the tuples in X, those that have an identifier existing in both a and e are of a supertype X?Y, while those that do not are of the subtype X.

The next post will clarify how value types and correlation types are similar and how they are different. For now it will suffice to mention that this series will focus on a type oriented approach to design and that every type mentioned is a subtype, supertype, or both of some other type.

Interestingly, the super correlation itself has no subtypes (only supertypes).

Enum Correlation

A similar looking, but different functioning correlation type is the Enum relation. Given the a definition for X and Y:
  • Attribute A with value a in Relation X, which is an identifier for X
  • Attribute B with value b in Relation X, which is a foreign key to E
  • Attribute E with value e in Relation Y, which is an identifier for Y
  • Attribute F with value f in Relation Y
In this case X has a foreign key B to another relation's (Y) identifier. Note, that in this case the foreign key is not an identifier itself as was in the case for the super correlation. Here X has either it's own local key identifier or possibly an identifier that is unrelated to Y. It should not be assumed that because A is not declared as foreign key above it is not a foreign key to any relation, only that it is not a foreign key that correlates to any part of Y. With this correlation type it can be said that X aggregates Y. The result is that,
  • Every tuple in X must correlated to a tuple in Y, and
  • multiple tuples in X may correlate to the same tuple in Y
  • tuples in Y exist independently of any correlation from X, meaning there may be tuples in Y that do not correlate to and in X
  • thus the enum correlation's cardinality considered is a zero-or-many-to-one.
The effect is that tuples in Y represent a list of possible values related to a tuple in X. This type of relationship type removes the need for any enumerated value type. When possible and practical, relationship types are preferable to value types.

There is one specialized version, a subtype, of the enum correlation: Unique Enum. In this correlation type the foreign key (in X) is also a local key, but not an identifier. The result is that while tuples in Y represent options for a value in tuples in X, the cardinality is different: zero-or-one-to-one.

This cardinality is the same (just in the reverse direction) as that of the super correlation, but the conveyed design intention is different. It means that Y represents options for values in tuples in X, but each option represented in Y may only correlate to exactly one tuple in X, and not all options in Y have to be represented in X. In the unique enum, instead of saying that X aggregates Y it can instead be said that X owns Y.

Jointype Correlation

The third common correlation type involves three relations rather than just two. It also has two subtypes. These relations are defined as:
  • Attribute A with value a in Relation X, which is an identifier for X
  • Attribute B with value b in Relation X
  • Attribute C with value c in Relation Y, which is an identifier for Y
  • Attribute D with value d in Relation Y
  • Attribute E with value e in Relation Z, which is an identifier for Z
  • Attribute F with value f in Relation Z, which is a foreign key for A
  • Attribute G with value g in Relation Z, which is a foreign key for C
In this form of correlation Z has two foreign keys, one to A and one to C. In the base Jointype correlation, references to A and C are local keys for Z. Note that a direct enum correlation is formed between Z and Y and between Z and X.

The foreign key association through Z creates an indirect many-or-zero-to-many-or-zero between X and Y. This jointype is formed by two enum relations, but is defined be the indirect correlation between X and Y, not either of those two relations' direct correlation to Z. The properties of this correlation type are:
  • each tuple in X exists independently of tuples in Y and Z
  • likewise, each tuple in Y exists independently of tuples in X and Z
  • there is no restriction on how tuples in Z may refer to tuples (so long as they exist) in X, Y, or any combination of the two so long as for every correlation to some tuple in X there is exactly one to some tuple in Y and vice version.
The definition for jointype correlations is not limited to three tables. Z could correlate to any number greater than one other relation.

The two subtypes of the jointype relation involve one or both of the enumtype correlations from Z being converted into a super correlation.

In an Enum JoinType correlation, one of the foreign keys in Z associated with an identifier in X (or Y, but for the purpose of this definition it will suffice to pick only one) is also a local key. This creates a super correlation between X and Z. The correlation between Z and any other other relations that are still enum correlations remains unaffected.

The effect of this change between X and Y is that now in Z there is a functional dependency of g on f, and thus there may only be up to one correlated tuple in Y for each tuple in X. Tuples in X and Y may exist independently of any correlation from Z or from each other. The cardinality of the relationship between X and Y becomes zero-or-one-to-zero-or-many.

While is it important to note that the local key in Z on X (or Y) does not have to be the identifier for Z, in most cases it should be. The main consideration is that as stated above, the identifier for a tuple should never change. If the correlation between X and Y is defined as a trait of the information that X represents, then using that F as the identifier for Z is worth considering (or likewise for G if the information Y represents defines the correlation between the two). The distinction is subtle, and best left alone if there is any doubt.

In a Unique JoinType correlation, both of the foreign keys in Z associated with identifiers in X and Y are also local keys in Z. This creates a super correlation between X and Z and between Y and Z.

The effect of this change between X and Y is that now in Z there is a mutual functional dependency between g and f , and thus there may only be up to one correlated tuple in Y for each tuple in X and vice-versa. Tuples in X and Y may exist independently of any correlation from Z or from each other. The cardinality of the relationship between X and Y becomes zero-or-one-to-zero-or-one.

Closing

While there are other more complex correlations types (those with finite cardinalities involving numbers greater than one) they are not really important to introduce at this point. The most important correlation types have been covered. In the next post, the topics of Typing and Semantic Meaning will be covered. While this post discussed the Relational Model specifically, what will follow is more focused on topics that are more ancillary to the Relational Model. The third post will return to rapprochement of design considerations such as human needs centered around semantic meaning and robust implement on sound theory.

Comments [0]

Guitar Quote, not "Violin"

The "Violin" quote that I have been searching for turns out to be a Guitar quote, and it was from Hoekman, not Cockburn:

"Playing the guitar badly is vary hard, Playing guitar well is very easy." (Designing the Moment, Hoekman, 2008, p224)

Sadly, the "veteran master" that Hoekman attributes this to is unnamed. 

Comments [0]

Exploring Vision, Language, and Visual Language

This is clearly not a continuation of the UP4edu post... I am still working on that, bogged down in tangential projects. One such tangent is an exploration of language, medium, and technology choice. It started with a reference to The Alphabet vs. The Goddess in no fewer than two Agile books. I am wanting to say one was in a Cockburn book and the other was Wicked Problems Righteous Solutions, but sadly I neglected to note either in my Tomboy notes.

This resonated with some of the left-brain-right-brain reading I had been doing in another peripheral scan. There are clear connections between these topics and: learning styles, accessibility, inversion of control, and information mapping. I am hoping to do a satisfactory, though admittedly never complete, sweep of this whole tangent soon so i can get back to UP4edu. A light and sufficient exploration if you will.

I propose that the predecessor to rich media was conceived in the moment the first illuminated document was drafted. This likely predated the Bible. Any association of pure text, and pure imagery will do. There are three distinct combinations of "pure text" and "pure imagery": iconographs, illuminated or illustrated text, and assemblage. Ranked in that order, any combinations of these three would likely fall under the guide of the highest ranked component. Iconographs are technically a form of pure text and barely count, so primarily I am concerned with the latter two.

Iconographs are very close to my heart, but their role in rich media is only cursorily different than elements of pure text or pure imagery. Their unique status of being both doesn't lend anything profound as all pure text as artistic elements and all pure images have communicative value.

Love and Joy About Letters (Shahn) is a delightful exploration of the interplay between illustration and assemblage. There is in fact a spectrum between these two facets of textual images. A spectrum exists between each and iconographs, but it is a much less profound spectrum.

Images and text both invite interpretation, but in different ways. Images overpower text, and yet humans gravitate to the safety of words. Images are volatile. Words seem safe, and they can certainly be crafted with great precision to be clear in intent. In the end, rich media brings us back to the realization that these two very different forms of communication have weaknesses.

Text is merely mundane iconography for an intangible, mutable, and universal unit. Text can be translated. The medium is weak and arbitrary, but there is an underlying reference to pure and raw information.

Images can only be experienced through the visual sense. Translations of images into text or any other experienceable sense are paltry in comparison. Here the experience is pure and raw, and the medium is of the essence.

Shahn's journey is religious and worldly, a man's exploration of the things that have happened and are happening to him as seen through the lens of one that believes in the divine. There is innate integration of spirituality and the world, just as there innate integration of image and word.

Shlain, in the Alphabet Versus the Goddess, asserts quite strongly the effect of language and language medium on our sense of religion. This hints at a pervasive inversion of control, that is to say a foundational technology being an influencing force in the direction of evolution of nearly all societies. Both spoken language and written language would have a hand in the process. Art too would be implicated. The cyclical effect visual art and literature has played is rarely questioned in modern dialogs on the matter. The driving undercurrent of the written and verbal forms is however a topic that is much more commonly held below the socially conscious surface.

Admittedly, I am only half way through the Alphabet Versus the Goddess. It seems herculean in effort to work though, much like Cosmos (Sagan).

Illuminated texts present a problem for Shlain's portrait of the early Jewish divine. To invoke a modern meme, clearly at some point someone was "doing it wrong" and they have expounded that error a thousand fold sense. If the Ten Commandments really do counterindicate images in the way Shlain suggests, illuminated texts, and the attempt to visualize the divine is entirely misguided.

One need only look to inspirational pieces such as The Saint John's Bible (Sink, The Art of The Saint John's Bible) to see this dilemma. It mixes divine word with an extra layer of interpretation. Time will only tell if this is a true modern rapprochement, or a reincarnation of Babel.

Gender is much ado in the first half of Shlain's book. In a more modern context it no longer seems relevant what gender God may be, if it is even appropriate to assign gender to the divine. The latter has been my assertion for the entirety of my adult life. Such would be the ideal, assuming the foundations of our language are not somehow subversively continuing to deter movement towards a society that embraces diversity and equality.

It seems comical to many (Introducing the Book) to consider books as a technology, much less the letters of our alphabet. Color television is more obviously a technology to some than color paints.

The elements of visual communication include (cobbled together from personal experience; Visual Language, Bonnici; and A Visual Language, Cohen & Anderson): color, line, direction, shape, boundary, contrast, texture, proportion, relative placement, and complexity to name just a few broad aspect categories all affect the cycle of message, expression, and interpretation. The mindful application of these aspects are a technology.

Single images, simple or complex, much like single letters, words, and sentences convey ideas and impressions. It may be a complex idea, but rarely will a single image convey narrative consistently. A single image may induce narrative through recollection or imagination, but this relies on contextual interplay in the viewer. Sense of direction and relative placement can create narrative, but more often the end result in cognition is more of a compound image than a single image.

Sequential images (Sequential Images, Wigan) set the stage for narrative communication through image. A composite image, as differentiated from a "single" image, may have the capacity to express narrative if the order of sequence is clear. Ambiguous order can create narrative, but in an unpredictable way. Inevitable order creates more cohesion. Optional images, and interactivity as a driving force in image sequence and content create and entirely different forms of narrative.

Words too can create narrative, seemingly much more effortlessly for the person conveying information. In contrast, images are typically more effortless for the person interpreting information. This is perhaps one reason combinations of the two are increasingly popular, there is a strike in balance between cost to produce and cost to consume. Broadening the audience is another likely reason. Lexical differentiation is possible and powerful, but visual differentiation is more immediately striking. A two fold attack creates powerful rhetorical reinforcement.

The consumption of images is an "all at once" practice. Unlike reading, where the primary mode is linear scanning, taking in an image starts broad. Color and boundary perception are a strong suits for instant recognition, and it is a large factor in animation and creating data visualizations. Shape too can be instantly recognizable, but is plays a second string to color and is quickly trumped in visual processing priorities. Shape, in ways it differs from Boundary (the two are inexorably linked) seems to require more coordination with between the two sides of the brain. Considerations along these lines are important when sequential images come into play. The speed at which these images can be processed for picking out spacial arrangements of color and shape simultaneously tend to favor color.

It would seem desirable to select language technology based on some ideal set of criteria. These might include: easy to learn, fast to read, and highly memorable. Other considerations are also important. The level of expression and creative choice could easily have an impact on the culture built upon the language. Adaptability, parity, and compact representation are all factors in how effectively and efficiently language can be used. Written language is clearly visual, but how well it uses the visual senses can be brought to question.

One could argue that the selection of language form is driven by human tendency, just as the assumption that technology adoption is driven intentionally. Assuming a natural selection of such things might seem reasonable, but the means of selection in this case are the very things that have advanced societies through the ages:  violence, material greed, and stratification. Clearly, these are not the lofty ideals upon which I would like to see society and subsequently technology built upon.

If we don't understand the true reason one technology has won out over another, assuming a natural selection is folly. More over and all things being equal, natural selection does not favor the best overall or long-term solution, it favors the immediate best.

Human nature has great potential to both correct error against overwhelming odds and compound error despite clear consequence. One can only hope that is there is a surreptitious effect in our choice of language, that it can be mitigated.

Language bias, an unnatural allegiance to language can be observed in individuals. Possibly innocuous in most cases, it can in others be as dangerous as any form of discrimination. Admittedly, we are very much indebted to societal progress due to language. When this translates into a conscious or subconscious valuation of particular language however, there is a problem.

I first became aware of the existence of this when discussing the speech challenges a friend's son was having. He was perfectly bright, but verbally behind what might be considered normal progress. I inquired if they had looked into exploring sign-language as an auxiliary means of communication. The child had no diagnosable hearing challenges, but had mind-to-severe speech impediment. My friend was optimistic on this possibility until her finance chimed in that he had heard that going this route would be problematic. Reliance on sign-language could become a "crutch" that the child would too readily fall back onto rather than "pushing through" with progress on speech.

This alone was not too alarming, but subsequent conversations with other people on the popular subject of immigrants and language resonated. It further became clear what the issue revolved around when reading about Deaf culture and the political issues around hearing impairments and medical devices (such as cochlear implants) that supposedly enable people to live "more normal lives". Friends who I would have never in a million years thought of as discriminatory didn't seems to understand that someone might legitimately desire to not have the implants. They uncategorically saw parents refusing the highly invasive surgery on their infants to be considered a form of neglect.

Clearly, this is a case where in the minds of some, proficiency with a specific language and being "like other people" with the same basic capabilities is equated with judgment of human value. There is a base line definition of what is important to be considered human. Hearing is clearly on some people's list, even if its inclusion isn't conscious. It was especially interesting where some individuals did not hold consistent views: they felt forcing immigrants to know English was ok, but forcing deaf individuals to accept "hearing values" was not, or vice-versa. It is similarly interesting that parents of Autistic children seem much more willing to embrace modification to language than hearing parents of deaf children.

While I will go so far as to say that this can be attributed to the adage: "Best is the enemy of the better", that might be interpreted by some as heresy. Parents that are given reason to believe there may be a "cure" are likely to fixate, reasonably if not unrealistically so, on the proposition without any real consideration for the genuine nature of such a cure. Some doctors and companies hype up the level to which cochlear implants can "improve" the life of a child born with an inability or severely reduced capacity to hear sound.

Given the results I have been exposed to, admittedly indirectly, there is no reason to believe that cochlear implants are close to being a real cure for the majority of people with hearing loss. Deaf individuals I have talked to mostly downplay or reject the value of even the more modern devices. Everything I have read sends similarly mostly negative signals. It remains an excellent option for some, primarily individuals already accustomed to hearing.

Diversity issues aside, it seems medically questionable to force the procedure on any individual. Considering diversity, the notion that individuals should be made or expected to conform is deplorable. This polar view certainly does not take into consideration the full range of complexities in the decision making process any parent must undertake in selecting medical treatment for their child. For now it is all I can hope for that parents consider more than a shallow sense of "normalcy" as an appropriate deciding factor.

One place where adjustments to language rather than the individual seems to be more readily accepted is in application to children with challenges such as Autism (Visual Language in Autism, Shane & Weiss-Kapp). Gestures and images work well for most, text works for a few. Because Autism falls along a spectrum with many individual factors involved there isn't a single "right" approach, only approaches that work, or work better.

The expressiveness of such languages is functionally limited. Symbol order follows formulas with exceedingly simple syntax. One concern with such watered down language is that it might "blunt" the development of speech. Clinical experience applying these adaptive languages indicates the opposite, that they help bridge difficult conceptual gaps resulting in better acclimation to other peripheral language skills rather than being a "crutch". The approach proposed in Visual Language in Autism embraces three ways to deal with barriers, each when it is most effective: Fit-it, Compensate, Bypass.

The concept of shibboleth, the use of language to distinguish between insiders and outsiders or imposters, like written language, predates the birth of Christ. It is a form of discrimination based on language, a way to judge. While shibboleth has legitimate defensive purposes, like any other technology is has a wealth of abuses. Language is naturally barriered: hearing, vision, light, concentration, a noise free environment, and literacy are all examples of things without which one may run into communication challenges.

Certain combinations of technology lead to erosion of these barriers. Candles allowed early man to read in the dark, focal lenses allow some with visual challenges to see what would otherwise be blurry. Electronic text is perhaps the closest we have come to a communication panacea to date. Electronic text can be automatically translated between languages with varying levels of reliability that is improving each year. It also can be read out loud, bridging the visual barrier all together. Electronic text can be printed to physical pages and braille. It can stream across the screen helping to compensate for hearing barriers.

It is in this one sense that perhaps luckily, alphabetical text seems to have been the right long term choice. Optical character recognition for converting printed text into electronic is by no means trivial, but it at least seems much simpler than the process for programming a computer to parse a more visually rich and directionally diverse written language.

None the less, romantic ideas of visual languages are fun to explore for some. I have long been a fan of The Elephant's Memory and other similarly visually engaging approaches. Sense of shape and direction are very stagnant in our modern language. Even ideogram languages, such as Japanese and Chinese have fairly stagnated sense of visual space. Our words have become reliant on images to make up for the things they cannot express as elegantly, though infinitely more efficiently. This brings a third "E" criteria to my standard list: Effective, Efficient, and Elegant. Each of the three are interrelated, but distinct.

Comments [0]

Philosophy on IT Accessibility

I was recently asked about my "philosophy for providing a campus IT infrastructure that is accessible". Accessibility is a topic that I feel very passionately about. It is one of the great opportunities of our time, to improve our world for everyone by making it equatable for as many people as possible. My long standing belief is that the principles applied to design with accessibility in mind do not have benefit limited to only a few. It has the effect of both normalizing usability and also improving it.

Much to my chagrin, the timing of my response was rushed and reading back over it I was not entirely happy with it. I wanted to share a revised version with some effort to further the clarity and flow of my vision:

Statement of Commitment to Accessibility

Providing equal access to academic and administrative information is a challenge I encountered early in my career as an IT professional. The first accessibility solutions I implemented for the College of Engineering during the late 90's as a student employee were unwieldy to maintain. We were working with static content, a poor understanding of accessibility requirements, and conflicting needs. The pages had to “look good”, but this was difficult if not impossible to accomplish in the years before cross browser CSS support.

The solution at the time, given all of the forces in play, was to create separate plain text versions for the more graphically oriented pages. As one might expect, the alternate versions were haphazardly updated and linked in. The pages that required no alternate version were basically accessible, but the level of usability was rarely comparable for users reliant on assistive technology. The lesson I took away at the time was that accessibility was hard. While I appreciated the ethical imperative, it felt like a monumental effort for the limited benefit we were able to render.

In the years that followed the technological and social land scape changed drastically. What I grew to realize through success and failure is that accessibility itself had never been the hard part. It was other, now questionably valuable, requirements that made accessible web site design difficult. Early approaches had the wrong foundation, both in application of technology and the prioritization of values.

Addressing accessibility proficiently while staying on the cutting edge of technology requires specialized knowledge and highly motivated exploration. It also requires a commitment to placing organizational values before affinity to specific technologies. In more broadly understood areas, accessible design can be distilled down to reproducible techniques which are more easily communicated. Our community has done well in adapting to the challenge of providing more equal access through effective outreach on the part of specialized, motivated, and engaging individuals. With the right level of engagement, those willing to learn a method that works are then enabled to apply it creatively. As experienced creative individuals become motivated, specialized exploration leads to innovation.

The biggest challenge I have observed with the active pursuit of innovation is that it is expensive and harbors risk. The practice of gaging the scale of innovation is a particularly useful concept advocated by several authors of Agile development process books. This is a practice interface designer and author Robert Hoekman (Designing the Obvious, 2007) calls “elevation”. The goal is to innovate in modest amounts and on a sound foundation to manage the level of risk while still creating competitive advantage.

In my transition to the NCSU Libraries I found a community that was achieving accessibility in a broader spectrum than my previous experience with accessible, primarily static web sites. I had basic exposure to various assistive technologies in the College of Engineering, but the point of service in the Library was much more direct, and diverse. The realization induced by this new proximity is the potential of assistive technologies ranging from broad application to individual application. There are a wide variety of accessibility needs and an appropriate, sometimes highly specialized, technology for each. Just as making early web projects accessible was a challenge, so too were pioneering assistive technologies for each unique need.

Accessibility techniques for the web have the luxury of enhancing the access for a broad audience. In contrast, there are many other assistive technologies each more valuable in day to day application to specific individuals. As web based technology becomes more sophisticated and the physical limitations to computing make it more omnipresent, there is a growing potential to provide high quality, equal access, at significantly lower cost.

The web is potentially the most cost effective medium for barrier-free access. Pages that are accessible serve a broad range of visitors: people with a myriad of browsers, display devices, means of interacting, and modes of perception. This high capacity for diversity is further enhanced by the rapid adoption of the web for social activity and communication. Other assistive technologies are still critical for thorough coverage, but the proliferation of web activity creates an opportunity to serve a wide variety of needs through a single medium. This enables higher quality while reducing the need for separate accommodation and thus reducing cost.

My philosophy for providing a campus IT infrastructure that is accessible hinges on providing services that increase equitable access while reducing the need to handle special needs as a costly exception. As much as possible, I try to treat accessibility as a special case of usability. Universal design considerations extend the potential for benefit from accessibility improvements to everyone. This is observable in real life. Well designed and convenient accessible ramps often receive high levels of traffic from individuals otherwise capable of using stairs. When the accessible ramps are long and awkwardly placed as an afterthought, this synergy is much less common.

There are cases where rapprochement between technology innovation and accessibility consideration is difficult, primarily because it involves fresh or particularly complex challenges. These are areas where the cost of innovation has to be carefully weighed. Every effort made to bridge usability and accessibility, the pursuit of more universal design, reduces the complexity and the cost. Picking the most ripe challenges and delivering value regularly builds momentum and community support towards this goal.

Comments [0]

Back to the Temporal Records (in Relational Databases)

I really don't like to leave things stewing in my mind for too long, so I've been trying to resolve some of the issues raised in my previous blog post. The primary issue resides in trying to model data generally.

Some tables in the database for Journey will have time values included: either zero or more single time values and zero or more pairs. Using a database as a repository of facts, the default assumption is that if a value is currently stored it is true now but no assertion is made as to whether that value was true in the past, will be true in the future, or is true over any interval of time other than this very instant.

The issue arises that we need to store historical data, so we can differentiate what is true and what was true. We also need to be able to enter data about what will be true in the future. The presence of temporal data makes the question of "when" clear for data that it is associated with, but muddies the water for any data that lacks the temporal quality.

One thing that we do keep is an audit of changes. The history of values for all fields is kept (including who made the change). This history may be truncated as space necessitates, but it is one way to track previous values. One application for this is user directed "undo". This is not related to atomic transactions, the database technology handles those. The issue with the audit table is it makes no distinction between two very different types of changes to a given record: correction of an error vs. updating facts.

The book I mentioned in my previous post, Temporal Data and the Relational Model (Date, 2003), covers temporal data as it relates to updating facts and keeping track of past and future facts. The assumption here too is that correction of error is as simple as updating the record. The relational model does not intrinsically handle the case to tracking all previous "incorrect" values, though it can certainly support a problem domain where that is needed. The approach for "true" data over time in the book is thorough.

There are several temporal properties about data that we care about for Journey, these have been somewhat revised since the last post:

  • Current - The record represents a truth that is currently applicable. This condition is mutually exclusive with Past and Future. It is implied by Past Pending.
  • Past Pending - The record represents a truth that is current and there is a known ending time.
  • Past - The record represents something that was true at some point in the past, but is no longer true. Mutually exclusive with Current and Future.
  • Future - The record represent something that will be true at some point in the future, but is not yet true. Mutually exclusive with Past and Current. 
  • Start Indefinite - The record has no start date.
  • End Indefinite - The record has no end date.
  • Ubiquitous - The record is always true.

To better answer the question of "how to differentiate between records with no temporal data and those with no start" I decided to redefine the use of "no temporal data" to be more in line with the default assumption in relational databases: currently true. More specifically, if no temporal data is associated with a record, is is assumed currently true recognizing that it could change at any time. In contrast, there is a specific case where data associated with both an start and end date can be considered ubiquitously true ,though it may change in the sense that it can of course be corrected.

The limitation of this method in expressing time is that it is not directly possible hold to both the tenant that Nulls should not be stored while also allowing the database to express that a record exists with while not being true at any particular time. This limitation is not a problem in Journey, and is generally possible to model indirectly in other ways.

The truth tables above include options for Null (n) in both the start and end time value. This does not indicate a stored Null, but rather the absence of a column for that purpose. Null is treated differently than "Unknown" (u) in that Unknown specifically means "always" in the past or future, i.e. until the beginning of time for start fields and until the end of time for end fields.

Unillustrated in the first post, the table on the right shows values returned when comparing two times. Null (n), Unknown (u), Past (p), Current (c), Future (f).

The chart below illustrates the results comparing a single point in time (presumably the current time) to a range. The results column "u" in the chart below is "Ubiquitous", not "Unknown".


The semantics have changed slightly. Start Indefinite (previously Past Indefinite), and End Indefinite (previously Future Indefinite) always indicate a null or unknown value in their respective columns. Like Past Pending, Ubiquitous is a special case of Current.

The implication of this design is that not specifying a start date results in storing the fact as having always been true at least to the present, similarly not specifying an end date implies  that it will continue to be true in perpetuity. If a record is to be limited in time either the start or end date, if not both, must be set when the record is created to avoid creating misinformation. This can be best avoided by not allowing a default value when the intention is to be restrictive temporally. Conversely, using the default value for both start and end date results in a record that is explicitly current and semantically identical to a record from a table that keeps no temporal data.

This helps clarify some of the issues that were left hanging from my last post. I feel a lot more comfortable with this redefinition of the application of time (or lack there of) to database relations.

Comments [0]

Temporal Records in Relational Databases

(I am still working on the second part of the UP4edu article, some reading has necessitated the extended pause.) 

One of my better (or worse in some people's opinion) properties as a programmer is my fairly idealistic view. I've become more familiar with the relational model over the past few years and am very attracted to the purist approach: high normalization. I also feel strongly about avoiding the abuse of Null values, especially where relational columns are concerned. These are of course ideals, and not always practical given time constraints and complexity. I do however feel that there is a special niche for relational database technology. It can't really be treated just like a trivial storage medium, and it also isn't a simple extension of programming activities. I feel that the best database approach is one that is clean (normalized) and simple in terms of architecture (harnesses patterns, but may have many many tables). I also feel that abstraction of database specifics from procedural code is valuable. This ensures that solutions are properly relational in the database, but properly modeled from a procedural point of view.

A book I read last year: Temporal Data and the Relational Model (Date, 2003) was very through, but didn't really solve any of the inherent issue in modeling data over time. It proposed solutions, but nothing profound. The primarily useful concepts this book covered were in issues of ambiguity.

Journey is the code name for an application I am building for work. The application doesn't have a real product name yet. Issues of time are central to a lot of the things managed in this application. I have struggled a lot to model time appropriately and I think I have finally settled on a model that will fit the needs.

First and foremost, Journey aims to be database technology agnostic. It does have a certain set of minimum relational feature requirements, but the goal is to avoid anything on the database end that is non-standard and not widely supported. In the long term, Journey is agnostic to the format time is stored in. There are some inherent dependencies in the early code base between Journey and a MySQL implementation, but these will be abstracted. Journey supports both date (Year, Month, Day) and date-time (date + Hour, Minute, Second) formats.

In the case of each, the entire unit of time is considered a block of time starting with the instant that block starts and ending just before the next sequential unit beings (start inclusive, tail exclusive). This avoids one of the primary concerns with ambiguity in time values. Sequential time ranges should not overlap.

Records in a Journey database table that have only one time field are considered instantaneous over the range of time (a day or a second) the column represents. For columns that represent the start of something, the start is the very beginning of that range. For columns that represent the end of something, the end is at the very last instant of that range. Columns of this type may also indicate something happening any non-specific instant between the start and end of the range. If a table tracks when files were deleted for example, and only tracks dates (not full time in seconds) then the instant is sometime on that date, but not necessarily the first or last instant. This is subtly different from the use case where a table tracks when files will be deleted. For this reason, all time columns in journey are considered either informative, normative starting, or normative ending. Informative means "any time in this range", normative starting means "at the first instant of this range", and normative ending means "at the last instant of this range". If a prescriptive delete date is normative starting on a date, it should happen at the very beginning of that day. If a prescriptive delete date is normative ending on a date, it should happen at the very end of that day.

In addition to tables that contain a single time column, tables may contain a normative starting and normative ending pair. Tables in general may have any number of informative, normative starting and normative ending time columns.  However, the only allowed functional pair is a normative starting and normative ending. For modeling purposes, Journey considers this pair to be a single column, though in many relational implementations they may be two. It matters little (in theory) if the underlying storage is an actual time value pair, or a start time and a number of units the range spans.

At any given time, a value in a time column will have one of four variable values in relation to any other time value: Null, Past, Current, Future. As I have stated before, I dislike abusing Null values. To be clear, Null is a valid non-value. It is however suggested by some idealists that Null values should not be stored in a Database, and will go to great lengths to prove why it is unnecessary to do so. I consider myself to be in a class of people that agree with this, but have to work with people that do not see the big deal.

Journey does not inherently allow the storing of Null. In so far as it is concerned with the installation of the databases it uses, Journey specifies when possible that no column may be Null provided the database technology supports a reasonable method for default values. Interactions between specific business logic requirements and database technology limitations may necessitate the use of Null in the absence of no reasonable default value, but this is considered an intermittent last resort.  I reject the notion that default values should be allowed to violate column constraints.

That having been said, it is perfectly valid for operations to return a Null value. The operation of comparing two dates returns Null in the event either of the two operands is Null or in some way invalid (assuming the operation does not throw an exception). In MySQL, the use of 0 dates and 0 times (0000-00-00) is an example of an invalid date that is valid in column constraints. All time operations involving  a 0000-00-00 will return Null.

Journey primarily cares about three qualitative property values when comparing two time values: is the first before (past), during (current), or after (future) the second. There is an interesting consequence to the way Journey handles time ranges. My preference was to have Journey store only the starting time unit and an integer number or value one or greater to indicate the number of units the range encompasses. This turned out to be imprudent with the MySQL implementation and the more database technology agnostic features have been side-tabled in the interest of progress until more development time is available. In an ideal world, it would be irrelevant how the database stores the data. In the meanwhile, it is stored as two time values.

The only issue with storing two time ranges is the lack of constraint of one column's value on another. The second date in the range should never come before the first. The "invalid" values outlined below are a result of such possibility and are not really part of any value that would be stored. This possibility is listed for completeness only.

Comparing a time range to the current time results in a few more interesting quality values than the four mentioned above between two. These possibilities for the basis for queries involving time ranges in Journey. The chart on the right below illustrates the logical reduction of these return values, and their explanation is on the left. The possibilities are not all mutually exclusive. So a method that returns all relevant values would be multiplexed. Alternatively, a set of operators could return true or false for each possible quality.

When start and end time are Null or simply each individually invalid, the result is "Unknown" (u). This means that the record has no known associated start or end date.

If there is an end time, and it is in the Past, the result is "Past". This means that the start time is implicitly in the past, even when invalid. If the start time is explicitly Current or Future then the result would be "Invalid", though this value should not be allowed to store.

If there is any non-null end time, and a null start time the result is "Past Indefinite". This means that the start time is unknown, but the end time is known. Similarly if there is an non-null start time and a null end time the result is "Future Indefinite" which indicates there is a known start time, but no known end time.

If there is a current or future end time (and a start time not after the end time which would be "Invalid"), then the result is "Past Pending". This means that there is a known end time, but it has not happened yet. The reverse of this would be "Future Pending", but as it turns out there is no valid set of inputs where "Future Pending" is different than "Future" and invalid inputs are irrelevant.

If either the start time or end time are current, or the start time is null or past and the end time is future (again assuming the start time is not after the end time), then the result is "Current". This indicates the current time is within the time range.

Since "Future Pending" and "Future" are logically the same, the later is folded into the other. If the start time is future and the time range is not invalid, the result is "Future".

An interesting problem is that it is unclear how to model events that are temporally ubiquitous, since defining no start time and no end time is ambiguous. It could mean all times, or no time. If the former is the case, then "Unknown" would be synonymous with "Current". How this case will be modeled in Journey is still under consideration. Ideally, records that have an integrated time range by virtue of table definition including out-going foreign relations would only exist if some portion of their temporal nature is known, so either start or end time if not both should be required valid (i.e. no 0000-00-00). For records that receive a foreign relation from another table to attribute temporal data, the presence of a related time range with the Unknown property would indicate all times while the absence of any related time range would indicate no times. The problem with this is it means Null and Unknown would be distinct, and that Unknown should only result when one (if not both) of the operands is invalid, but not Null.

Comments [0]

UP4edu, Part 1

Another Agile methodology I really like is the Unified Process family, especially Enterprise Unified Process and Open Unified Process. It is especially flexible, and while I really do like the broad application base of Agile methodologies, there is one significant problem I encounter time and time again with UP in general.

All of the modern Agile literature I have read is written around the assumption of software production in a business model not completely compatible with the way I work. UP is particularly, though unnecessarily, grounded in the values of big business. My job is IT support and software development for an educational institution, and the factors driving development are significantly different in subtle ways.

The types of projects I work on conflict with Unified Process approaches in several ways. Inception can't always be a drawn out process. Despite the claims that OpenUp is appropriate for "small teams" it is unwieldy compared to Scrum. Part of the issue lies in the particular way UP nests iterations. OpenUp has a certain prescription with certain minimums that lack flexibility.

A major qualm I have with Agile processes in general is the obsessive fixation on iteration. Must one do everything repeatedly when a single time is truly sufficient? This is akin to XP's prescription of always only programming in pairs. The practice of Pair Programming is often useful, but not ideal. It is a good idea to double-check one's work, and for important documents making multiple drafts is prudent. Some tasks however are casual and mundane enough that no second thought is required. There is a fine line between the notion of "trivial work" and falling into the Waterfall trap. One main flaw in the Waterfall Method is that it relies on creating a chain of contingencies. The point of Iteration is to encapsulate the activities of each such that one Iteration does not hinge on another in ways that are hard to orchestrate, or inflexible to deviation.

Defining a Project, Interation One

One important, though often under emphasized, concept from Unified Process is the project configuration. Beyond the way UP defines "configuration" there is a more broad application for the same idea. UP4edu (1.0) is a first attempt to generalize in a meaningful way the UP process to Educational Institution Software Projects, without unnecessarily limiting application to institutions that are educational in nature or projects that develop software.

Thinking metaphorically about how my work environment is defined,  consider a set of Objects modeling a Solution Development problem domain:

A Project is a single Product over the course of one or more Deliverables (a deliverable Iteration of the Product). The project has one or more Customers. These customers are Agents that have Needs. Agents may be People or Systems. The success of the project is measured in how well the product meets the needs of the customers. The needs change over time, and to keep pace the product changes over time as well.

An iteration is a period of time where the product of a project is changed to better meet needs. Iterations may be deliverable, or not, variably.  Any iteration may be made deliverable if the needs that it satisfies warrant the cost of delivery. Similarly any iteration's delivery may be canceled in the face of changing needs. A project may be comprised of only deliverable iterations, or many undelivered iterations leading to a single deliverable, or any combinations of undelivered and deliverable Iterations. Every iteration results in a different Version of the product, deliverable or otherwise.

A new project addresses an initial set of needs, each new iteration addresses new or changing needs. The project has a Vision to make the overall goal across iterations clear. This vision may evolve, but is generally fairly static and long term.

A project is managed by exactly one Project Lead, a Manager. It is also managed by exactly one Iteration Lead, a Developer. Projects have one or more developers with time to devote to making the needed changes to the product each Iteration. In the most trivial case, the project lead and the iteration lead may be the same person, and they may be the only developer on the project.

The project manager's responsibility is to mitigate Internal and External Risk that may affect the iteration's successful completion (not the level of success in implementing effective solutions). The project manager also formulates and evolves the vision when needed through interactions with customers. This process happens across iterations, and is viewed at a level of detail called the Product Life Cycle. The customer is primarily aware of activities at this scope.

The iteration manager's responsibility is to ensure that the current iteration addresses as many of the highest potential needs possible while staying on schedule. Efforts on a project in an iteration are limited by time constraints on the developers. The resulting new product at the end of the iteration will implement new Solutions. Depending on the project Configuration, some solutions may be optional to complete the iteration, or some solutions may have cheaper implementations that can be selected to keep schedule.

During each iteration, the iteration manager interacts with customers to adjust the needs that will be addressed by the Iteration in the most cost/return effective manner and to ensure that solutions implemented in the iteration yield high value. Human customers (people) are consulted to generate Primary Needs Knowledge (such as Use Cases) and from those requirements additional Secondary Needs Knowledge may be developed between the GUI and underlying Service Layer.

Within an iteration, developers incrementally create and improve solutions that satisfy customer needs. Needs knowledge gathered from customers indicate needs and Potential Solutions. The potential solutions are prioritized based on how critical the needs they address are and based on any interdependencies (which should be kept to a minimum within an iteration). Each day developers work on a single Increment of one or more solutions. Increments of a solution that work better in at least one respect and as well if not better in all respects may be Committed to the iteration's new product. A single solution may have more than one increment in an iteration, but each must either be successive, or carefully Merged if two happen in parallel. Only the final set of solutions is included in the finished product for the iteration.

External risk is generated by agents outside the scope of the Iteration, including customers. Internal risk is generated by agents inside the iteration scope, including the project lead, iteration lead, developers, and any agent members of the Project Configuration. Risks represent forces that drive changing needs. When risks are ignored or unaddressed they can reduce or completely negate the value of solutions when applied to the needs they are designed to satisfy.

Solutions are procedural functions that satisfy agent needs. Solutions have a Quality rating, which is variably dependent on the agent applying it to a need and the configuration of the need over time.

Configuration is a declarative aggregation of information (facts) and solutions. Agents, projects, products, and needs have configuration. While agents Apply solutions recursively or in sequence, configuration can affect agent solution application through Reflection. Communication between agents can also initiate or affect solution application.

Project configuration includes information and solutions (such as development tools and processes) that help the developers produce solutions for the product.

The product's configuration is composed of information and Internal (private) solutions that may be used by the External (public) solutions exposed to agents using the product. A product's configuration may be altered during delivery as part of the delivery cost to make that product's solutions better suited to the agents using the product (an Instantiation of that version).

Additionally there are several special cases worth noting:

Developers on a iteration may be affiliated with one or more Organizations which donate time of the developers in the expectation that the resulting product will better meet some of the organization's needs. Such organizations are agents which are also a collection of (one or more) people, and are considered Patrons for the project.

Patrons may have direct or indirect influence on the vision, but care must be taken that this influence is constructive from the customer viewpoint, and not destructive. When patrons are also customers, their patronage may dominate and skew the vision (an external risk). When patrons are not customers, their patronage may complicate or cause inconsistency in the vision (an external risk).

Working directly with customers is not always possible, and it is not always ideal. Proxy Customers (customer advocates) and Customer Models (personas) are two substitutes for working with real customers.

All people, while agents, have a particular component of their configuration commonly called Free Will. Free Will has two primary consequences that must be considered as risks. First, since free will is part of a person's configuration, if can affect how they apply solutions. People that are agents applying a developed solution cannot be expected to apply it as intended or instructed the way system agents can. Second, people that are customers cannot be expected to apply solutions included as part of the software development process as intended or instructed. This is also a concern for developers, but customers have less incentive to follow the chosen software development process.

Project configuration includes software tools (including hardware), non-software tools, development processes, standards, working environment, staff that support the developers in their efforts (dedicated or otherwise), and other factors enabling development.

Projects in general are a special case of products, a sort of meta-product.

In addition to needs knowledge, which indicate how the product's solutions should address needs, there is a need for insight into potential external risk from the customer. The iteration lead should also collect Solution Risk Knowledge (Abuse Cases), risks that the product should specifically address. These are not project risks, and as such are not handled by the project lead. These are product risks, and are rarely an exhaustive list. As customers use the product additional solution risk may be discovered and can be prioritized for addressing in a future Iteration.

Solution risk includes bugs implicitly, but bugs are often poorly classified. Defects include problems with a solution that are both strictly internal to the product and represent a failure to mitigate solution risks that were identified to be addressed within the iteration. Risks can change, and risks can be misunderstood. Deviations include problems with a solution that are a result of a mismatch between the Current Risks and solution risk knowledge. A solution with no defects is attainable, but solutions almost always have deviations.

Defining a Project, Iteration Two

A second look at the problem domain of solution development reveals a more generic model. There is a danger in distilling a model too far: it looses semantic meaning. None the less, I feel this deeper level of abstraction is useful when considering the two very hard to reconcile domains of Computer Science and Social Science. Computer Science is an applied Math. I feel many professionals try to bleed an Engineering discipline out of Computer Science, and they feel by squeezing it with enough Social Science such a discipline can be attained.

Rather than trying to bleed a turnip, I prefer to follow a different popular methodology: separation of concerns. Social Science can tell us a lot about the process of programming, but the actual result of programming will always be Math. There needs to be a clean division between these two disciplines if either is to be applied effectively and appropriately.

Modern approaches like Object Oriented programming abstract the problem domain of programming computer hardware to a level much easier to understand for humans. Object Oriented programming in particular has a very weak basis in Math where as approaches like the Relational Model also abstract solutions while keeping a firm mathematical basis. Both approaches have benefits and weaknesses. Abstraction in general can weaken or limit a potential solution, but the potential trade-off is a set solution that are easier to understand, develop, and maintain.

A mathematical view of computer science supports many concepts needed to model solution development.

At the basic level, programs are comprised of Values and Functions. Through the lens of Social Science, these are essentially Things and Events respectively. Without leaving the mathematical basis, values are static and Immutable.  They are simply facts that exist, and are identified by the value itself. A Function is a Dependence between values, and it can be represented in a number of ways. Functions too are also immutable, and identified by the dependence they represent. These two simple concepts create many computational possibilities, but they lack certain catalysts.  The first issue is that something must apply values to functions actively. Doing this, one may line up functions end-to-end like dominoes to create limitless interesting results from a single initial value, but the product remains a static line.

The notion of Identity is important. For values and functions as defined above, the identity is the value or function itself. In order to make something Mutable, something changeable is needed. Variables are symbols. They may be a symbol for any value or for any function, and may even be a symbol for another variable. Variables are identified by the symbol, but the symbol only defines how they are referenced, not what they are.

Sets and Tuples are another needed concept from Math, these create Structure. A set is an unordered group of any number of unique values. A tuple on the other hand is an ordered group of a specific number of values, and may contain the same value more than once. Each of these can be considered a form of association both between the identity of the set or tuple and its components, and among its components. From these mathematical constructs, the notions of Type, Atoms, and Corpus are possible. A type is simply a set of values and/or functions. An atom is a typed variable, meaning that it has a symbolic link with a value, but that value must be a value in set of a specific type. A corpus is a set of atoms.

It is important to consider that values can be a set, tuple, or a Trivial value. Trivial values (also called Scalars and indicated as * in the diagram above) have exactly one distinct components. Sets and tuples have zero or more distinct components. Tuples (also called Arrays), specifically Pairs are important because they allow for order and branching which is needed to establish procedure. A pair is a tuple with exactly two values. Functions can be defined as a pair who's values are tuples of the same cardinality (size). The Nth value in the second tuple can represent the value dependent on the Nth Value in the first tuple. This definition reveals that functions are merely a specific type of atom.

While other cardinalities of tuples are useful, it is often best to use the simplest components possible. Pairs make it possible to represent information as a Tree of values. Just as it is possible to actively chain functions linearly, a tree can be actively traversed in a particular order: the first value in the pair, then the second. If the first value in the pair is another tree, its first value is traversed, then the second and then the second value of the original tree.

A Process is a tree which has an implicit linear structure, has a specific input type, and has an output atom. A process is like a function, but it has the flexibility to Branch and a Constraint on the input value and output value.

Looking back at the first definition for the process of solution development, everything defined can be modeled using these few components.

Forces are a medium for change. They have a Target corpus and are created (owned) by an Agent. The target, or any component of that corpus may be in input of the process. Entities are a form of corpus that have an Interface  allowing a limited degree of access to any components of the corpus that would otherwise be unavailable due to Closure (encapsulation). Agents are simply entities that may create forces in addition to providing an interface.

Patterns, Templates, Process, and Configuration

Configuration is a specific thing and has definition, even though most of the configuration for a project is intangible and subject to change, from moment to moment it is a specific structure of values. Part of a project's configuration is the people working the project, and part is the technology they are using to get the work done.

UP4edu does not deal too much with decisions on either of these two layers of configuration. It is assumed that the layer dealing with people is relatively fixed. There are some things your project configuration can influence when hiring new team members and assigning roles, but is is assumed that that project configuration is not the primary concern when undertaking these activities. Similarly, project configuration can influence technology choices as it directly impacts the process of developing solutions, but not as it impacts the product itself.

All aspects of the project, including people and technology related issues are reflected by the configuration. It is important to understand the difference between the reflection of these facts in the configuration, and the factors deciding them. Process configuration does define what people in various roles are supposed to do in the scope of their work on the project.

How teams arrive at a process varies, but in Agile development the process should be adaptive. This is one divergence point on various prescriptions of Agile process. Some XP and Scrum books assert the two processes are not negotiable and reconfigurable, others indicate a certain level of flexibility. A process that cannot be changed is simply a process, one that can be tweaked in specific ways is a form of Template. I am personally a fan of Patterns, and so as much as possible UP4edu defines any prescriptions in such terms.

Regardless of the level of adaptability, who makes decisions regarding adaptations is reflected by, but not decided by, configuration.

One reason patterns are so useful, and applicable to the model UP4edu proposes to use is their definition is framed in terms of forces and solutions. Patterns are flexible in their application. One pattern does not imply the use of another, though using patterns in tandem may have reproducible benefits. Rather, application of patterns is dictated by the nature of the problem, the forces that need to be resolved. Templates lack this flexibility.

UP4edu is not strictly a Pattern Language, but rather a model of the solution development process that includes a pattern language. In the second part of this post I'll define UP4edu, what it values, and how to apply it.

Comments [0]

Neapolitan XP

Another idea for my process name could be "Neopolitan XP". To be honest, XP is my least favorite of the so called Agile processes. Honestly, XP is the least agile because it is so necessarily rigid. In Extreme Programming Refactored: The Case Against XP, Stephens and Rosenberg launch a knock-downdrag-out no-holds-bar attack on Kent. This book is easily as relentless as it is poorly written. Maybe Kent deserves it...but none the less just like TV Wrestling it make all parties look pretty bad.

What I would like is something as far from "Vanilla XP" as possible.  There has to be a balance between the rigid prescriptions of XP and a directionless free form dance with no cohesion to the process that results in just being "Agile".

 The book does make some good points. We (the developer) can't put everything off on the "customer". We certainly can't do EVERYTHING for them, but they ARE a customer.

The book suggests that pair-programming and design documents are mutually exclusive needs. If you have one, you don't need the other. Unsurprisingly design documents are the more attractive proposition of the two.

Knowing when not to refactor is a problem I stuggle with constantly. I still feel that somewhere between reasonable prefactoring and planned, constant refactoring there is a better way to work. The issue with refactoring production code is important to consider. All the testing in the world won't catch every obscure problem that could occur when a refactor goes wrong.

One concept mentioned in the book I disagree with. It talks about "refactoring interfaces", specifically not to do it (220). Interfaces are not a program, interfaces are redesigned not refactored. The code that generates the interface can be refactored, but that should result in no observable change. The code that generates the interface can be changed to change the interface,but that is not a refactor. Refactoring belongs to a different concern in the domain of applications than interfaces. Similarly, data is not "refactored". Refactoring should have no net effect on the observable functional behavior of a program. It might make the program run faster, use less memory, maker fewer DB calls, etc, but in general the user of the program should see no functional change. Refactoring is applicable to databases. As originally envisioned by Cobb, databases are accessed through a sort of "sub language" and so are in essence a program.

On the subject of Use Cases, Cockburn's "6 bits of precision" is referenced:

  • Bit 1 - Name Goal
  • Bit 2 - Describe Main Scenario (this is the level of resolution in an XP "User Story")
  • Bit 3 - Failure Conditions
  • Bit 4 - Failure Actions (level of resolution in traditional Use Cases)
  • Bit 5 - Data in/out Description
  • Bit 6 - Recipient of the message (level of resolution in Catalysis)

There is a difference between an iterative spiral and a never ending loop. Stephens and Rosenberg suggest that XP embrases the later, but both the argument to embrace such a thing and the suggestion that anything as often discussed as XP would seriously do so it preposterous. This underlines the primary flaw in the book, anything taken to the extreme is easy to pick appart. Doing so does proponents on both sides of the argument a disservice. The attack on YAGNI is the crippling chink in this book's armor. Out of all of the XP tenants, YAGNI is the most sound and well established priniciple in software development.

One thing I like about Scrum is it makes a disctinction between an iteration and a delivery, specifically each delivery is one or more iterations. In XP there is no such distinction. The two are the same. Constant stream of value and constant integration has merit, releasing early and often does not.

Task cards are useful. They can take many forms, but they are definitely useful in team work. Jira's implementation of  Task cards results in high-value-low-overhead.

The authors' denial that interim releases have value is laughable. Anyone that has tried and succeeded at this knows the value. This isn't universally valuable, and in teams still learning to be Agile it is not even cost effective. The authors' rejection that people involved in a process, not the smart-alec authors that use that project as a case study for rhetoric in a book, are the party truely qualified to define success is also laughable. If a team learns how to better work together better through trying XP, even if they don't adopt "Vanilla XP" in future projects, then XP has been successful. It seems the authors ignore the primary focus of XP, people.

Sadly, it seems that the Stephens and Rosenberg devoted more time to thinking up parody lyrics to pad each section rather than actualy writing the book. This might explain why it so often slips into an overly irreverent hen peck of XP. The arguments are too far-stretched to be credible. Contrary to the foot note on page 372, this book isn't worth purchasing, nor was it worth it in 2003 when it was first published.

Learning from other people's mistakes when formulating your own process is bound to be only slightly less effective than making your own. I, like the author's had an initial, "ZOMG, what is this pinko crap?" reaction to XP. The difference is I wrote a blog article about it before I realized there were some scraps of value, rather than over 350 pages of diatribe.

Comments [0]

Crystal Meh

So, the more I think about it the more I want to play around with my own home brew Agile process. One thing I want to think about is Agile for a team of as few as 1, which many processes don't address specifically. Everyone seems to LOVE pair programming, and clearly you can't pair program with only one person. In our group, the team size can and does fluctuate between deliveries, and even iterations. This is not due to staffing change, this is due to  work allocation. With small teams this is bound to have significant impact.

I know it isn't all that productive to "make up" a process, but it seems like a fun activity. I can introspectively assess what I've read, what I currently do and don't do, what works well and what doesn't. The idea is to propose a framework process like Crystal or Scrum that prescribe optional "features" and describe what they help accomplish and how they interact.

While the whole "Crystal" line is intellectual property of Alistair Cockburn, I will internally refer to this method as "Crystal Meh". According to urban dictionary.com this an actual controlled substance, but let's face it, that is a really stupid name for a drug. On the other hand it makes perfect sense for a laissez-faire method loosely based on Crystal and Scrum. If anything comes of this, I suppose I'll have to come up with a better name though.

Comments [0]

Putting Agile To Work

All the reading in the world won't prepare you for real life. I keep telling myself that, but it never seems to sink in.

I have checked out book after book on Agile. While they all have great ideas, I find that applying what I have learned is a slow process. One reason for this is I already have over a decade of programming experience under my belt, mostly flying solo. Old habits die hard. I am getting better at breaking up work into units that can reasonably be finished and checked in the same day. I'm not quite as proficient with taking work to be done and dividing and estimating it, but that is something I am working on.

We use Scrum in my department. I'm not sure when the decision was made that Scrum was going to be the method of choice, but we have only recently started having the prescribed daily meeting. We've been having the monthly meeting for at least half a year. For the most part things seem to be settling into a smooth operation. Starting a major project with a completely new technology, a team still learning to work together, a rather poorly defined development environment, and a process that we have haphazardly adopted has been quite the experience. How our process fits into the whole Scrum recommendation hasn't really been ironed out yet.

We are a small group and we don't have any automated facility for testing yet (that I am aware of). We do have a code repository (SVN) and for the most part seem to be able to manage that in conjunction with the project management software we have in place, Jira. I would attribute our relative level of success in the transition so far to those technologies and the various flavors of Eclipse we use. Some of us use plain Eclipse, and others use Zend Studio. In either case we all use plug-ins for PHP, Javascript (Aptana), and SVN to make editing the code easier.

I will say that I am not crazy with the Apatana plug-in specifically, it has a lot of things we don't need or want. Unfortunately it was the only Javascript plug-in that we have found that properly handles the ExtJS framework code we use.

Going Agile has not been painless. In our case this has more to do with how we went about it rather than the switch itself. We had a project that was struggling because it wasn't getting the resources that were promised and as a lone developer I wasn't able to sacrifice my vision of the project to meet the deadlines that had been set. When additional people were brought onto the team there was a decision to switch technologies and the decision was also made to wipe the slate clean of all the underpinnings I had built. I won't say this was a bad decision for long term progress, but it did put the project even farther behind initially.

The change in technology for the project's architecture has proven to be a difficult hurdle. For the most part we have learned what we need to make progress, but it has put the project severely behind. This could have killed the project, and there were times I was ready to completely give up. I am fairly surprised that the constant slide in the schedule didn't result in an abortion of the project or other serious consequence. We haven't fully adopted Scrum for this project in the sense that the Product Owner and work backlog have been formalized. That is something that will likely be ironed out as we come close to releasing our 1.0 which is a working proof of concept.

Our development environment was already inline with what we needed to make Agile work. I didn't have experience with the tools, but everyone else did. This is another scenario which I can envision the project nearly if not completely failing. Without these tools, the progress to this point would not have been possible collaboratively. There have been several times where team members had no choice but to work on the same file at the same time. In a smaller project or more mature this might have been avoidable. Similarly, if we had all just been adopting Eclipse/SVN/Jira the administrative overhead would have likely crushed the project.

One thing working in our favor is no one on the team has rejected the process. We do still have to work with each other a bit to ensure that the process is followed, such as checking in work on a regular basis and logging work in Jira. I feel fairly adept at this now, but I know to an extent that my participation in that activity is tied to my motivation and confidence. Before the project officially became a team effort I had been using Jira, albeit amateurly, when it was a solo effort to what seemed to be a huge success until I started to fall behind. I gradually lost motivation to keep Jira updated and as the problem compounded stopped using it altogether.

Gradually the progress is improving. Adoption is slow, but we are getting there. Our team faces some turn over and possibly additions in the near future, so it will be interesting to see how it all goes.

I personally am more attracted to the Crystal method. I think that somewhere between the features of Crystal Clear and Scrum we could find a better way to work, especially in ways where both also overlap with RUP. Most of our challenges that remain are not process issues, but rather technical and personal issues. We are too unaccustomed to MVC, and personally I am trying my best not to buck the pattern entirely. I really don't feel MVC is appropriate for our design.

 

Comments [0]

Javascript: The Good Parts That Reveal An Inconveneient Truth

I have had a feeling for quite some time now that something was wrong. For over two decades now I have been programming. Granted the types of programs I was writing in 1989 were quite trivial, mostly Commodore Basic. My first exposure to object oriented (OO) programming was PIL, an interpreted scripting language for the Pirch IRC client. It was around that time I also started tinkering with Javascript.

It wasn't until after I started college classes that I really worked extensively with OO programming, in fact by the time I started my undergraduate work that was the only style of programming taught to computer science majors aside from a semester in assembly. My program was even transitioning to Java which is an exculsively object oriented langague, as opposed to mixed languages like C++. I had some exposure to Lisp through my AI class and PERL. While I loved LISP, my real love has been PHP. PHP itself has become increasingly OO over the past two versions.

Some programmers take OO for granted. It is an established paradigm. What I have learned gradually over the past couple of years, but have only recently become aware of is how Object Oriented design fits into the bigger picture. This is something I am starting to come to terms with, but bit by bit it becomes more clear.

Douglas Crockford's Javascript: The Good Parts is another step towards this understanding. Javascript and object oriented Lisp have a lot in common. Though syntactically (and thus superficially) Javascript is more like C and thus also to an extent Java (of which it is an even more superficial namesake), it's concept of objects comes from Self and Scheme. Scheme is one of the two major Lisp variants.

Reading through Javascript: The Good Parts, I was reminded of another book I had read recently about, Common Lisp Object System (CLOS) which is an extention of Common Lisp, the other major variant of Lisp. The concepts raised in both books reminded me of Date's rants on oobject oriented programming versus the Relational Model.

The most striking conclusion that I can draw from all this information is that while there are things I like about OO, it is much less of a science and much more of a belief system. The prototyping variety of OO seems to be the more technically sound and correct way. While class-instance based OO is easier to grasp assuming one can settle on one implementation it imposes limitations that make sense, but the variations from language to language are arbitrary.

This makes modern OO, which is predominantly the class-inheritance variety, little more than a programmer religion. We make leaps of faith based on the models that seem "right". There is nothing inherently wrong with this, it is a humanistic approach. The important consideration is to be cognizent of the motivating factor. We embrace class-inheritance for the same reason we program in high level languages, it is a more conveneient way to comprehend.

As much as I rejected Date's notions early on, there are things he has to say about OO that make sense. The effectiveness and flexbility of the prototype approach to OO certainly seems to support to his arguments. This version eliminates many of the issues while making OO a more expressive. It does seem to make things more complicated, but in the case of Javascript this has a lot to do with the C like syntax.

The features of Javascript that fly in the face of class-inheritance OO are certainly tempting. They bring eligance to OO in different ways than the type-oriented approach I am working on. Hopefully as I get closer to implementation some new doors will open up.

Comments [0]

Three Rules for Programming Security Effectively

Wanted to post this while still fresh in my head.

Security is an often overlooked facet of a successful project. The simple fact of the matter is, not every programmer can be a security expert and they shouldn't have to be to do their job. I would go so far as to say that most programmers should not even have to think about security on a daily basis. The problem is, how do you simultaneously maintain these two lofty ideals: making highly secure software while keeping your programmers focused on their specific functional tasks?

1. The average programmer should not have to consider security issues on a daily basis. This should be delegated to a member of the team that focuses part or all of their time on security. They need to have this time allocated and blocked. They can interrupt for critical situations, but cannot be interrupted by other critical situations because security is a core concern equal in severity to availability.

2. Security is built into the framework by the dedicated security team member. This team member proactively seeks to improve security in the allocated time and reactively corrects any new issues at the framework level whenever possible.

3. Other programmers will find security issues and issues caused by security measures. When this happens they need to involve the security team member. Not all security fixes can be integrated into the framework, but most can. By using the framework to automatically handle most security issues, programmers can more confidently focus on their assigned functionality.

 

Comments [0]

Refactoring, One last push.

So, originally I had intended to blog about the whole process of refactoring the Twitter RDF feed tool. I also had intended to doing the whole project in a week, or even two at the most. This brings up two very important Agile concepts:

1. Documentation and other deliverables, while sometimes important, significantly drive up the cost of development. In this case, the documentation isn't directly related to the product, but the process of creating it. Real documentation would not have resulted in this much of a delay unless it was a severe case of design/analysis paralysis. While there is quite a bit more to document on the process, some outside pressure has necessitated a speedy release. In this case the value of documenting the process does not out weigh the "losses" incurred by the service being down in the meanwhile.

2. Continuous delivery is key. The process of "improving" a product should not result in a severe (or even significant) delay in being able to make the next release. That is to say, if I've been doing this right, I should be able to push out a working product very quickly even though I am no where near done making all the improvements I would like. The account that the software was originally developed to enable has been "noticed" by a large number of people on campus. If I don't get it working soon many of them will likely unsubscribe from the service and while they are not "paying customers" the point isn't to directly profit, it is strategic placement.

As such, I have decided to cut back on the refactoring blogging (because I've been too busy to do it lately in any case) and focus on getting a working, cronable, script in place ASAP. My new goal is to have a working version in place by Monday.

The product itself will be documented, in so far as it will have PHP Doc generated information. Once operational, I'll do one or two more blog posts, postmortem discussing the work that went into converting the use cases into reusable components.

Comments [0]

The Insanity I Am Working On, Zend Framework based MVC alternative.

So it is no secret I haven't had as much time to work on the Refactoring Code blog posts as much as I would have liked to lately. For a while I was pulling serious over time, hadn't started going to the gym like I said I was going to, and not really getting enough "me" time in general. I cut back to a more reasonable amount of mixed Overtime/Personal Project time and started the gym routine and also going to a couple of local gaming groups. My blood pressure is _real_ happy about that.

Aside from the refactoring projects, I've also been working on a code framework that spawned from a structured authoring framework. I sat down to code my ideas for a Structured Authoring Framework, based on a data model (called FLORA) that I've been working on only to find that I needed a code framework to make the coding process manageable. I like Zend_Framework, but I am not a fan of MVC in general. It's not that MVC isn't good in theory, it is just that I have yet to find an implementation I can live with.

My main work project has similar framework needs. It is internally called "Journey" because it began with the idea of being able to track personnel data over the course of a person's employment with the Library, it balooned out to include features like managing personal information and membership groups in our Active Directory, Intranet/Document management (though fortunately it looks like we will be attempting to use Confluence for that particular functionality) and in general being a backbone database for all non-ILS data related to the business of the library. Journey's scope became so extensive that has been dubbed the "E-Matrix of Library Administrative Data" (E-Matrix is the internal name for our internally developed ILS integration layer).

To meet the needs of these two ambitious projects I decided to first lay down a foundation with a framework based on my experience with other large scale projects. It is a roughly MVC pattern, but with some concepts that make it particularly well suited to hybrid Application/SOA style implementations. The idea is that one can build an application and underlying services in tandem, the application is primarily supported by the services and additional "glue" code. The services also provide ready-made resources for other applications. The frameworks also needs a high level of database abstraction so that the database can be highly relational (as my boss puts it, "overly normalized") without exposing the complexity of such implementations. The framework needs to:
  • provide a type based data with a rich library of filter/validation/decoration tools,
  • provide multiple views into the data provided by the service layer,
  • handle authentication/access concerns for access to the service layer, and
  • integrate well with the non-MVC portions of the Zend Framework in general to cut down on the code base.
My boss is nudging my main work development process in the way of SOA, and in particular the client based Ext JS technology (I'll concede to calling it a "Framework" so long as I can keep the term in quotes) supported by Zend Framework. This combination of methodologies pretty much cuts most MVC patterns down to the knees, because the View and most of the Controller are actually implemented client side in one language (JavaScript in this case) and the Model and a few traditionally Controller concerns are kept on the server side in a totally different language.

The end result is finally coming in to focus. I've settled on a temporary name of QP for the project "short for Quick Project), it is internally named "Quepie" since "QP" is a bit too short to be unique.

The backbone of the framework is a searchable registry/tree, called the Datatree. The tree is divided into primary branches called "directions". These were originally input and output directions, but since a configuration branch has been added, and others are conceivable. The idea behind directions is that searching always happens in only one direction, input and output are effectively firewalled off from each other, so accidentally using user input (or any input for that matter) becomes less likely. Each direction is further divided into branches which may contain other branches, or may contain a set of data. Each branch that accepts data can have a filter assigned to it, and all data that goes onto that branch is run first through the filter. A search of the tree can be as general as direction + data name, or can specify any one or more possible branches. The idea is to make it easy to search for something specific, or allow for a more generalized search.

As mentioned, the framework is very data type oriented, so there is also a component that handles the registration of different data types, and the appropriate filters, validators, and decorators (for various forms of output) for each. It is appropriately called the Datahandler. The Datatree uses the Datahandler to manage the filters for each branch.

More later. (I'll also try to add some pretty diagrams).

Comments [0]

Barakracy

I normally don't make political posts, but I'm really disappointed about something.

I was sitting around the game table, playing Power Grid with some friends when we were talking about various things. The election year politics obviously came up... it's just that time. Power Grid has a "bureaucracy" phase. Every one at the table is voting Obama. Incidental acquaintances aside, I don't think I'm friends with anyone that is voting McCain, at least no one willing to admit it.

The term "Barakracy" came into my head. I figured it was a long shot, but thought perhaps I was the first person to think of it, or at least I could be the first to Blog about it. Unfortunately, not only have I been beaten to the punch line, but by people that are too ignorant to understand the difference between healthy bureaucracy and the ridiculous mess we have now.

I wasn't thinking of Barakracy as a negative term. The kind of change Obama could help drive into our political system would be a vast improvement over what we have now. Yes, Democrats are about more government involvement, but many people confuse the emphasis to mean simply more government. The tricks played by the republican party since the first Bush administration (traceable back at least as far as Regan) have made it very clear that the hot-cold approach to government involvement doesn't work, if for no other reason than because it gives politicians even more opportunity to  further their own financial advantage.

It's convenient to place the blame for the current mess we have on this country with Democrats that are stereotyped as promoting "bigger government". Such a claim, however, is not accurate. The Republicans have an equal if not greater share in the blame because when it comes right down to it the party has a fundamental conflict of interest between the party ideals and how they manage to effect change for the people they are supposed to be representing.

I welcome Barakracy. Bureaucracy is necessary, the alternatives are anarchy or monarchy. Our representative government is built on a prerequisite foundation of bureaucracy.

By definition, democracy is bureaucracy. When implemented in the right portions, bureaucracy is effective. Barakracy is a change for the better that we can believe in. I feel it is a term we should tout with pride.

Comments [0]