Archive for the 'Evaluation of plans' Category

Is the term ‘absolute truth’ meaningless?

Thorbjørn Mann, July 2020

Some thoughts about ‘absolute truths’, systems thinking and humanity’s challenges. An exploration of knowledge needed for a discourse that I suggest is critically significant for systems thinking related to questions about what to do to about humanity’s big challenges.  I apologize for the roundabout  but needed explanation.

‘What better be done’: absolute truth? 

There are recurring posts in Systems Thinking groups, that insist on decisions being made by focusing on the ‘right’ things’, or what better (best) be done, implying that what is ‘better be done’ is a matter of ‘absolute, objective truth’. Thus, any suggestions about the issue at hand are being derailed — dismissed —  by calling them mere subjective opinions and by repeating the stern admonition to following the absolute truth of ‘doing what better be done’, as if all other suggestions were not already efforts to do so. 

Questions about what those truths may be  are sidestepped or answered by the claim that they are so absolute, objective and self-evidently true that they don’t need explanation or supporting evidence. Heretical questions about this are countered with the question such as  “are you questioning that there are absolute truths”? Apart from the issue whether this may be a tactic by proponent  of an answer (the one declared to be an absolute truth) to get the proponents’ answer accepted,  is it an effort to sidestep the question of what should be done altogether stalling it in the motherhood issue of absolute truth? At any rate, raising questions. 

Does this call for a closer examination about the notion of ‘absolute truths’, and how one can get to know them? What is an ‘absolute truth’ (as compared to about a not so absolute one?) 

Needed distinctions

There may be some distinctions that need reminder (being old distinctions) and clarification,  beginning with the following:  

‘IS’- States of affairs in ‘reality’  versus statements about those 

There exist situations, states of affairs ‘s’ constituting what we call ‘reality’. Existing, they ‘are’. Whether we know them or not; (mostly. we don’t.) And if we know and recognize such a state, we call it ‘true’.  But isn’t that less a ‘property’ of a state ‘s’,  than a label attached to the statement, about ‘s’? About ‘s’,  is it not sufficient to simply say ‘it is’?  So what do we mean by the expression ‘absolute truth’? As a a statement about ‘s’ , it would  seem to imply that there are states of affairs that ‘are’ ‘absolutely true’ and others that aren’t? So would it not  be necessary to offer an explanation of this difference? If there isn’t one, does  the ‘absolute’ part become meaningless and unnecessary?  

So the practical use of ‘true’ or ‘false’ really refers to statements, claims about reality, not reality itself. When we are describing a specific situation ‘s’  or even claiming that it exists, we are making a claim, a statement.  When such a statement matches the actual state of affairs with regard to s, we feel entitled to say that the statement is ‘true’. Again: ‘truth’ is not a property of states of affairs but a judgment statement about ‘content’ statements or claims. 

About the claims of a statement ‘matching’ the actual state of affairs. Do we really know ‘reality’, and how would we know? Discussions and attempted demonstrations  about this tend to use simple concepts — for example: “How many triangles are depicted in this diagram?”. The simple ‘answers’ are both ‘obviously true’ (even though people are occasionally disagreeing even about those) —  but  upon examination based on different understood definitions of the concepts involved. The definitions are not always stated explicitly, which is a problem: it leads to the troublesome situation where one of disagreeing parties can honestly refer to answers based on ‘their’ definition’ as ‘true’ and to other answers  as  ‘false’ (and consequently questioning the sanity or goodwill intentions of anybody claiming otherwise). So are all those answers ‘absolutely true’ but only each given the appropriate related definitions and understanding? 

The understanding of ‘triangle’ in the diagram example may be  that of “three points not on the same straight line in a plane, connected by visible straight lines.”  There may be a fixed ‘true’ number of such triangles in the diagram. But if the definition of ‘triangle’ is just “three points not on the same straight line'”,  and it is left open whether the diagram itself intends to show a plane or a space, the answers become quite different and even uncountable (‘infinitely many, given the infinitely many points on a plane or in a space depicted by the diagram, that exist in triangular position relative to each other).

The term ‘depicted’ also requires explanation: does it only refer to triangles ‘identified’ by lines connecting three selected points, lines drawn by a color different from the color of the ‘plane (or space) of the diagram? If drawn by the same color, are they n o t  ‘depicted’? Do the edges and corners of the diagram picture ‘count’ as ‘depicting’ the lines and apex of a triangle, or not?  So even in this simple ‘noncontroversial’ example,  there are many very plausible answers, and the decision to call one or some of them ‘absolute truth’ begins to look somewhat arbitrary. 

Probability

The label ‘true’ or ‘false’ apply to existing or past states of affairs. Do they also apply to claims about the future (that is, to forecasts, predictions),   The predicted states of affairs  are, by definition, not ‘true’ yet. The best we can do is to say that such a statement is more or less ‘probable’: a matter of degrees we express by a number  from 0 (totally unsure) to 1(virtually certain) or by a ‘percentage’ number between zero and 100. 

Actually, we usually are not totally certain about the truth even of our claims about actual ‘current ‘ or ‘always’- states of affairs. We find that we often make such claims only to find out later that we were wrong, or only approximately right about a given situation. Even more so, about more complex claims such as whether a causes b  and whiter it will do so in the future. But it is fair to say that when we make such claims, we aim and hope to be as close to the actual situation or effect as possible. Can we just say that we should acknowledge the degree of certainty — or ‘plausibility’ — of our statements? Or acknowledge that a speaker may be totally certain about their claim, but listeners are entitled to have and express less certainty — e.g by assigning a different certainty, probability or — I suggest –‘plausibility’  to the claim? Leaving a crumb of plausibility for the ‘black swan’?

‘OUGHT’ claims and their assessment:  ‘Plausibility’ rather that ‘truth’ 

For some other kinds of claims, the labels ‘true’ or ‘false’ are plainly not appropriate, not even ‘probable’. Those are the ‘ought’-claims we use when discussing problem situations (understood as  as discrepancies between what somebody considers to be the case or probable, and what that person feels ‘ought’ to be the case). The state of affairs we ‘ought’  to seek ( or the means we feel we ought to apply to achieve the desired state) are– equally by definition — not ‘true’ yet.  So should we use a different term?  I have suggested that the label ‘plausible’ may serve, for all these claims, expressed as a number n (for example ‘1’) between -n (totally implausible, virtually improbable or the opposite being true) and +n (virtually certain)  with the midpoint zero denoting ”don’t know’, ‘can’t tell’.  Reminder: these labels express just our states of knowledge or opinion, not the states of affairs to which they refer: we make decisions on the basis of our limited knowledge and opinions, not on reality itself (which we know only approximately or may be unsure about). 

How can we gain plausibility of claims? 

The question then is:  How do we get to know whether any of these claims are ‘true’  or probable, or plausible, and to what degree? Matching? Or: — since we can rarely attain complete certainty (knowing that there can be ‘black swans’ to shatter that certainty) — how can we increase our degree of plausibility we feel we can attach to a given claim.? What are the means by which we gain plausibility about claims? Possibilities are: 

1)  For ‘fact’-claims: 

1a) Personal observation, experiments, measurements, demonstration, ‘tests’. 

1b) Inference from other fact-claims and observations, using ‘logically valid’ reasoning schemes;  

1c) From ‘authorities’: other persons we trust to have properly done (1a) or (1b), and can or have explained this;

1d) Declaring them ”self-evident’  and thus not needing further explanation. 

2)  For ‘ought- claims:

2a) The items equivalent to (1a) obviously don’t apply:  So: Personal preference, desire, need, accepted common goals or ‘laws’

2b) Inference? The problem here is that inferences with ‘ought  or what I call ‘planning arguments’ — claims are inherently not (deductively) ‘valid’ from a formal logic point of view and because the label ‘true’ does not apply. However: for some of the factual premises in these arguments, reasons (1) will apply and are appropriate.

2c) From authorities:  Either because they have done 2a or 2b, or because they have social status to ‘order’, command ought-claims?

2d) ‘Self-evidence’?  For example: ‘moral norms’? Laws? 

Is ‘self-evident’ equal to ‘absolute’?

We could add claims about ‘meaning’, definition etc. as a third category. For all, is the claims of ‘absolute truth” equivalent to ‘self-evident?  It is the only one for which explanation justification, evidence is not offered, even claimed to be impossible, unneeded. What this means is:  if there are differences of opinion about a claim, can the proponent of such a claim expect to persuade others to come to accept it as theirs?  What if both parties should honestly claim / believe that theirs is the absolute truth? Claiming ‘absolute truth’ or ‘right’ or ‘self-evidence’ is not  a good persuasion argument, but if repeated sufficiently often (brainwashing) surprisingly, effective, history tells us.  If justification (e.g. by demonstration) is attempted, it turns into one of the other kinds.

So, for all these claims and their ‘justification’ support, different people can have different opinions (different plausibility degrees). This is all too frequently observed, and  the source of all disagreements, quarrels, fights, wars. The latter item (war) suggest that there is a missing means for acquiring knowledge: the application of coercion. force, violence, or in the extreme, the annihilation of  persons of different opinions. The omission is based on the feeling that  it is somehow ‘immoral’ (no matter how frequently it is actually applied in human societies, from the upbringing of children to ‘law enforcement’ and warfare).  

The need to shift attention to ‘decision criteria’ and modes acknowledging irreconcilable differences of opinion

There is, for all the goodwill admonished by religious, philosophical and political leaders, the problem that even with ample efforts of explanation and offering exhortation, reasons, arguments, definitions, situations may occur where agreement on the claims involved cannot be achieved — yet the emergencies, problems, challenges demand that ‘something must be done’. 

What this means, in my opinion, is that the noble quest for ‘truth’, probability, even plausibility as the better guide for community, social decisions — ‘solution’ criteria — making decisions based on the basis of the merit (value, plausibility) of contributions to the discourse about what we ought to do  (that we ideally would all agree on!) must be shifted to a different question: what criteria can we use to guide our decisions in the face of significant differences in our opinions about the information supplied in the discourse? The criteria for evaluation of quality, plausibility of proposed solutions  should be part of but are not the same as the criteria for good decisions.  It is interesting to note that the most common decision mode – voting — in effect dismisses all the merit concerns of the ‘losing’ minority. Arguably, it should be considered a crude crutch to the claim of ‘democratic’ ideals: equality, justice, fairness to all;  But also, that the very crisis cry ‘”Something must be done” is often used as an exhortation tool to somehow generate ‘unity’ of opinions. 

Issues for Systems Thinking

I suggest that this is an important set of issues  for systems thinking. Systems Thinking has been claimed to offer ‘the best currently available foundation for tackling humanity’s challenges. But has it focused its work predominantly on the ‘IS’ questions of the planning and policy-making discourse, rather than on the ‘ought’ issues? On better understanding of the (existing) systems in we will have to interfere? On better prediction of different plan proposals’ future performance (simulation)? Sure, those tasks are immensely important and the work on these questions admirable. But are they the whole task? 

As far as I can see, the other (‘ought’) part of planning and policy-making work — both the development of a) better evaluation, (development of measures of the merit of planning discourse contributions leading to ‘solution merit’  criteria) and b) the development of better criteria for planning decisions, in the face of acknowledged disagreement about the merit of information contributed to the discourse are at best still in the embryonic state. Systems thinking appears to many (perhaps unfairly so)  as suggesting that decisions should be based on the assessment of ‘facts’ data alone, ignoring the proper assessment of ‘ought’ claims and how they must be combined with the ‘facts- claims to support better decisions.   

The development of a better planning discourse platform

Of course, the ‘discourse’ itself about these issues is currently in a state that does not appear to lead to results for either of the above criteria: the design of the discourse for crafting meaningful decisions about humanity’s challenges is itself an urgent challenge. If I had not convinced myself, in the course of thinking about these issues, that ‘absolute truth’ is a somewhat inappropriate  or even meaningless term, I would declare this a main ‘absolutely truth and important’ task we face.  

–o– 

EVALUATION IN THE PLANNING DISCOURSE: WEIGHTING

Thorbjørn Mann, April 2020

WEIGHTING: ‘WEIGHING THE PROS AND CONS’

Concepts and Rationale


      Much of the discussion, and examples in the preceding sections may seem to have taken the assumption of weighting for granted: aspects in formal evaluation procedures, or deontic (ought-) claims in arguments. The entire effort of designing a better platforms and procedures for public planning discourse is focused in part on exploring how the common phrase of “carefully weighing the pros and cons” in making decisions about plans could be supported by specific explanations of what it means and, more importantly, how it would be done in detail. Within the perspectives of formal evaluation or assessment of planning arguments (See previous posts on formal evaluation procedures and evaluation of planning argument), the question of ‘why’ appears not to require much justification: It seems almost self-evident that some of the various pro and con arguments carry more ‘weight’ in influencing the decision than others: Even if there is only one ‘pro’ and one ‘con’, shouldn’t the decision depend on which argument is the more ‘weighty’ one?
      The allegorical figure of Justice carries a balance for weighing the evidence of opposing legal arguments. (Curiously: the blindfolded lady is supposed to make her decision on the heavier weight, not on the social status or power or wealth of the arguing parties, but not even to see the tilt?) Of the many evaluation aspects of formal evaluation procedures, there may be some that really don’t ‘matter’ much to any of the parties affected by the problem or the proposed solution that must be decided upon. Decision-makers making decisions on behalf of others can (should?) be asked asked to explain their basis of judgment. Wouldn’t their answer be considered incomplete without some mention of which aspects carry more weight than others in their decision?
While it does not seem that many such questions are asked (perhaps because the questioners are used to not getting very satisfactory answers?), there is no lack of advice for evaluators about how they might express this weighting process. For example, how to assign a meaningful set of weights to different aspects and sub-aspects in an evaluation aspects ‘tree’. But the process is often considered cumbersome enough to tempt participants to skip this added complication of making such assignments, and and instead raising questions of ‘what difference does it make?’, whether it is really necessary, or how meaningful the different techniques for doing this really are. And there are significant approaches to design and planning that propose to do entirely without recourse to explicit ‘pro and con’ weighting.
       Finally, there are significant approaches to design and planning that propose to do entirely without recourse to explicit ‘pro and con’ weighting. Among these are the familiar traditions of voting, decision rules of ‘taking the sense’ of the discussion by a facilitator in the pursuit of consensus or the appearance of consensus or consent, upon more or less organized and thorough discussion, during which the weight or relevance, significance of the different discussion entries is assumed to have been sufficiently well articulated. Another is the method of sequential elimination of solution alternatives (for example by voting ‘out’, not ‘in’) until there is only one alternative left. A fundamentally different method is that of generating the plan or solution from elements (or according to accepted rules) that have been declared valid by authority, theory, or tradition, which are assumed to ‘guarantee’ that the outcome will also be good, valid, beautiful etc.
       Since the issue of evaluation is somewhat confused by being discussed in various different terms: ‘weights of relative importance’; ‘priorities’, ‘relevance’, ‘principles’, ‘preferences’; ‘significance’, ‘urgency’, and there are yet unresolved questions within each of the major approaches, some exploration of the issue seems in order: to revive what looks at this point as a needed, unfinished discussion.


                  Figure 1 — Weighting in planning evaluation: overview

Different ways of dealing with the ‘weighting’ issue

Principle
      A first, simple form of expressing opinions about importance is the use of principles in the considerations about a plan. A principle (understood as not only the ‘first’ and foremost consideration but a kind of ‘sine qua non’ or ‘non-negotiable’ condition) can be used to decide whether or not a proposed plan meets the condition of the principle, and eliminate it from further consideration if it doesn’t. Principles can be lofty philosophical or moral tenets, or simple pragmatic rules such as ‘must meet applicable governmental laws and regulations to get the permit’ — regardless of whether a proposed plan might be further refined or modified to meet those regulations, or an exemption be negotiated based on unusual considerations. If there are several alternative proposals to be evaluated, this usually requires several ’rounds’ of successive elimination identifying ‘admissible’, ‘semi-finalist’, ‘finalist’ contenders up to the determination of the winning entry, by means of one of the ‘decision criteria’ such as simple majority voting — which here would be not ‘voting ‘in’ for adoption or further consideration, but voting ‘out’.

Weight ‘grouping’
       A more refined approach that considers evaluation aspects of different degrees of importance is that of assigning those aspects to a few groups of importance, such as ‘highly important’; ‘important’ and ‘less important’, ‘unimportant’, ‘optional’ and ‘unimportant’, perhaps assigning aspects in these groups ‘weights’ such as ‘4’, ‘3’, ‘2’, ‘1’ and ‘0’, respectively, to be multiplied with a ‘quality’ or ‘degree of performance’ judgment score before being added up. The problem with this approach can be seen by considering the extreme possibility of somebody assigning all aspects the highest category of ‘highly important’; in effect making all aspects ‘equally important’ — for n aspects each one contributing 1/n weight to the overall judgment.

Ranking and preference
      The approach of arranging or ‘ranking’ things in the order of preference (on the ordinal scale) can be applied to the set of alternatives to be evaluated as well as to the aspects to be used in the evaluation. Decision-making by preference ranking — e.g. for the election of candidates for public office — has been studied more extensively, e.g. by Arrow [1], finding unsurmountable problems for decision-making by different parties, due mainly to ‘paradoxical’ transitivity issues. Simple ranking does not recognize measurable performance (measurable on a ratio or difference scale) where this is applicable, making a coherent ‘quality’ evaluation approach based only on preference ranking impossible.
      An interesting variation of this approach is an approach for deciding whether a proposal should be rejected or accepted, attributed to Benjamin Franklin. It consists of listing the pros and con arguments in separate columns on a sheet of paper, then looking for pairs of pros and cons that seem to be equally important, and striking those two arguments out. The process is continued until only one argument, or one pair, is left; if this, or the weightier one of two is a ‘pro’ argument, the decision will be in favor of the proposal, if it is a ‘con’ argument, the decision should be rejection. It is not clear how this process can be applied to group decision-making without recourse to other methods of dealing with different outcomes by different parties,such as voting.
Interestingly, preference or importance comparison is often suggested as a preliminary step towards developing a more thoroughly considered set of weightings in the next level:

Weights of relative importance
       As indicated above, the technique of assigning ‘weights of relative importance’ to the aspects on each ‘branch’ of evaluation aspect trees has been part of formal evaluation techniques such as the Musso-Rittel procedure [2] for buildings (discussed in previous posts) as well as in proposals for systematic evaluation of pro/con arguments [5]. These weights of relative importance — expressed on a scale of zero to 1 (or zero to 100), subject to the condition that all weights on the respective level must add up to 1 (or 100, respectively), indicate the evaluator’s judgment about ‘how much’ (by what percentage or fraction) of the overall judgment each single aspect judgment should determine the overall judgment. In this view, the use of the ‘principle’ approach above can be seen as simply assigning the full weight of 1.0 or 100% to the one of the aspects expressed in the discussion that the evaluator considers a principle — , overriding all other consideration aspects.
      To some, the resulting set of weights may seem somewhat arbitrary. The task of having to adjust the weights to meet the condition of adding up to 1 or 100 can be seen as a nudge to get evaluators to more carefully consider these judgments, not just arbitrarily assign meaningless weights: To make one aspect more important (by assigning it a higher weight), that added weight must be ‘taken away’ from other aspects.
        Arbitrariness can also be reduced by using the Ackoff technique [3] of generating a set of weights that can be seen as ‘approximately’ representing a person’s true valuation. It consists of ranking the aspects and assigning numbers (on no particular scale) and then comparing each pair of aspects, deciding which one is more important than the other, and adjusting the numbers accordingly, until a set of numbers is achieved that ‘approximately’ reflects a evaluator’s ‘true’ valuation. To make this set comparable to other participants’ weighting, (so that the numbers carry the same ‘meaning’ to all participants), it must then be ‘normalized’ by dividing each number by the total, getting the set back to adding up to +1 (or 100). Displaying these results for discussion will further reduce arbitrariness. This can actually induce participants to change their weightings to reflect recognition and (empathy) accommodation for others’ concerns that they had not recognized in their own first assignments. Of course, the discussion requires that the weighting is made explicit.
      Taking the ‘test’ of deliberation seriously — of enabling a person A to make judgments on behalf of another person B — this can now be seen to require that A could show how A can use not only the set of aspects and the criterion functions but also B’s weight assignments for all aspects and sub-aspects etc., and of course the same aggregation function, resulting in the overall judgment that B would have made. It likely would be different from A’s own judgment using her own set of aspects, criteria, criterion functions and weighting. The technique using weights of relative importance thus looks like the most promising one for meeting this test. By extension, to the extent societal or government regulations are claimed to be representative of the community’s values, what would be required to demonstrate even approximate closeness of the underlying valuation?

Approaches avoiding formal evaluation       

The discussion of weighting would be incomplete without mentioning some examples of approaches that entirely sidestep the issue of evaluation of plans of the ‘formal evaluation’ kind and others using weighting. One is the well know Benefit-Cost Analysis, the other relies on the process of generating a plan following a procedure or theory that has been accepted as valid and guaranteeing the validity or quality of the resulting design or plan or policy.

Expressing weights in money: Benefit-Cost Analysis
      The Benefit-Cost Analysis is based on the fact that the implementation of most plans will cost money — cost of course being the main ‘con’ criterion for some decision-making entities. So the entire question of value and value differences is turned into the ‘objective’ currency of money: are the benefits (the ‘pros’) we expect from the project worth the cost (and other ‘cons’)? This common technique is mandatory for many government projects and policies. It has been so well described as well as criticized in the literature, that it does not need a lengthy treatment here; though some critical questions it shares with other approaches will be discussed below.

Generating plans by following a ‘valid’ theory or custom
       Approaches that can be described as ‘generative’ design or planning processes rely on the assumption that following the steps of a valid theory or using rules and elements that constitute the whole ‘solution’, (elements that have been determined as ‘valid’) to construct the plan, will thereby guarantee its overall validity or quality. Thus, there is no need to engage in a complicated evaluation at the end of that process. Christopher Alexander’s ‘Pattern Language’ [4] for architecture and urban design is a main recent example of such approaches — though it can be argued that it is part of a long tradition of similar efforts of rules or pattern books for proper building, going back to antiquity — either as cultural traditions know to the community or as ‘secrets’ of the profession. He claims that following this ‘timeless way’ of building “frees you from all method”.
      However, the argument that the individual patterns and rules for connecting these elements into the overall design somehow ‘guarantee’ the validity and quality of the overall design (if followed properly) merely shifts the issue of evaluation back to the task of identifying valid patterns and relationship rules. This is discussed — if at all — in very different language, and often simply posited by the authority of tradition (‘proven by experience’) or — as in the Pattern Language — of its developer Alexander or followers writing patterns languages for different domains such as computer programming, ‘social transformation’, or composing music. To the best of my knowledge, the evaluation tools used in that process remain to be studied and made explicit. The discussion of this issue is somewhat more difficult than necessary because of Alexander’s claim that the quality of patterns — their beauty, value, ‘aliveness’ — is ‘a matter of objective fact’.

Do Weighing methods make a difference?

      A question that is likely to arise in a project whose participants are confronted with the task of evaluating proposed plans, and therefore having to choose the evaluation method they will use, is whether this choice will make a significant difference in the final judgment. The answer is that it definitely will, but the extent of difference will depend on the context and circumstances of each project — especially if there are significant differences of opinion in the affected community. The trouble is that the extent of such differences can only be seen by actually using some of the more detailed techniques for a given project, and comparing the decision outcomes; an effort unlikely to be taken in a situation where the question of whether one technique is ‘worth the effort’ at all.
      The table below shows a very simple example of such a comparison. For the stated assumptions of a few evaluation aspects and weighting assignments, the different ways of dealing with the weighting issue actually will yield different final plan decisions. This crude example cannot, of course, provide any general guidelines for choosing the tools to use in any specific project. The above list and discussion of policy decision options can at best become part of a ‘toolkit’ from which the participants in each project can choose to construct the approach they consider most suitable for their situation.

       Table 1 Comparison of the effect of different weighting approaches


      The ‘weights of relative importance’ form of dealing with the issue of different degrees of importance in the evaluation considerations is used both in the formal evaluation procedures oft the Musso-Rittel type and, in adaptation, in the argument evaluation approach for planning arguments [5]; It may be considered most useful for approximately representing different bases of judgment. However, even for that purpose, there are some questions — for all these forms — that need more exploration and discussion.

Questions and Issues for further discussion

       Apart from the question whether the apparent conflict between evaluation techniques using weighting approaches, and those avoiding evaluation and thus weighting at all, can be settled, there are some issues about weighting itself that require more discussion. They include contingency questions: about the stability of weight assignments over time and different, changing context conditions, their applicability at different phases of the planning process, and the possibilities (opportunities) for manipulation through bias adjustments between weights of aspects and the steepness (severity) of criterion functions for those aspects.

The relationship between weighting and the steepness of criterion functions
       A perhaps minor detail is the relationship between the weight assignments of evaluation aspects, and the criterion functions for that aspect, in a person’s ‘evaluation model’. A steep criterion function curve can have the same effect as a higher weight for the aspect in question. To some extent, making both the weighting and criterion functions of all participants explicit and visible for discussion in a particular project may help to counteract undue use of this effect, e.g. by asking where the criterion function should cross the ‘zero’ judgment (‘so-so, neither good nor bad but anything above that line still acceptable’) and thus prevent extreme severity of judgments. This would assume considerable sophistication on the part of individuals attempting such distortion and of other participants in the discourse to detect and deal with it. But both in personal assessments and in efforts to define common social evaluations (regulations) expressed in terms of criterion functions such as e.g. implied by the suggestions in [8] there remains a potential for manipulation that at the very least should encourage great caution in accepting evaluation results as direct decision criteria.

Tentative conclusions and outlook

     These issues suggest that it is far from clear whether they can eventually be settled in favor of one or the other view. What does this mean for the the concern triggering this investigation, to explore what provisions should be made for the evaluation task in the design of a public planning platform? Any attempt to pre-empt the decision, by mandating one specific approach or technique should be avoided, to prevent it from itself becoming an added controversy distracting from the task of developing a good plan. So given the current state of the discussion, for the time being, should the platform offer just participants information — a ‘toolkit — about the possible techniques at their disposal? Can ‘manuals’ with guidance for their application, and perhaps suggestions for circumstances in the context or the nature of the problem, offer discourse participants in projects with wide, even global participation adequate guidance for their use? Or will it take more general education to prepare the public for adequately informed and meaningful participation?

     The emerging complexity of the issues discovered about even this minor component of the evaluation question could encourage opponents of these cumbersome procedures. Are calls for stronger leadership (from groups asking for leadership with systems thinking, better ‘awareness’ of ‘holistic’, ecological, social inequality issues, or other moral qualities actually indicators of public unwillingness to engage in thorough evaluation of the public planning decisions we are facing? Or just inability to do so? Inability caused perhaps by inadequate education for such issues, compounded by inadequate information and lack of accessible and workable platforms for carrying out the needed discussions and judgments? Or is there also some power desire at play, for such groups to themselves become those leaders, empowered to make decisions for the ‘common good’?

Notes, References

[1] Kenneth J. Arrow, 1951, 2nd ed., 1963. Social Choice and Individual Values, Yale University Press.
[2] Musso, A. and Horst Rittel: “Über das Messen der Güte von Gebäuden” In “Arbeitsberichte zur Planungsmethodik‘, Krämer, Stuttgart 1971.
[3] Ackoff, Russel: “Scientific Method” , John Wiley & Sons 1962.
[4] Alexander, Christopher: “A Pattern Language“, Oxford University Press, 1977.
[5] Mann, T: ‘The Fog Island Argument’ XLibris, 2009, or “The Structure and Evaluation of Planning Arguments” , INFORMAL LOGIC, Dec. 2010.
[6] Mann, T.: “Programming for Innovation: The Case of the Planning for Santa Maria del Fiore in Florence”. Paper presented at the EDRA (Environmental Design Research Association) Meeting, Black Mountain, 1989. Published in DESIGN METHODS AND THEORIES, Vol 24, No. 3, 1990. Also: Chapter 16 in “Rigatopia — the Tavern Discussions“, Lambert Academic Publication 2015.
[7] Mann, T: “Time Management for Architects and Designers” W. Norton, 2003.
[8] “Die Methodische Bewertung: Ein Instrument des Architekten. Festschrift für Professor Arne Musso zum 65. Geburtstag“, Technische Universität Berlin, 1993; Also: Höfler, Horst: Problem-Darstellung und Problem-Lösung in der Bauplanung. IGMA-Dissertationen 3, Universität Stuttgart 1972.

                                                     –o–

EVALUATION IN THE PLANNING DISCOURSE: CRITERIA AND CRITERION FUNCTIONS

An effort to clarify the role of deliberative evaluation in the planning and policy-making process. Thorbjørn Mann, March 2020

CRITERIA AND CRITERION FUNCTIONS

Concepts and Rationale

One of the key aspects of evaluation and  deliberation was discussed earlier (in the section on deliberation) as the task of explaining to one another the basis of our evaluation (quality / goodness) judgments, to one another: ‘objectification’. It means to show how a subjective ‘overall’  evaluative judgment about something is related to, or depends on other — ‘partial’ —  judgments, and ultimately, how a judgment is related to some objective feature  or ‘criterion’ of the thing evaluated: a measure of performance. Taking this idea seriously, the concept, its sources, and the process of  ‘making judgments a function of other judgments’ and especially of criteria, should be examined in some more detail. 

There is another reason for that examination: it turns out that criterion and criterion functions may offer a crucial connection between different ‘perspectives’  involved in evaluation in the planning discourse:  the view of ‘formal evaluation’ procedures such as the Musso-Rittel procedure [1, 2] , the systems modeling domain, and the argumentative model of planning. 

The typical systems model is concerned with exploring and connecting the ‘objective’  components of a system, the variables describing the interaction between them, for example in ‘simulation models’  of the systems behavior over time. The concern is with measures of performance:  criteria. The systems model does not easily get involved with evaluation — since this would have to tackle the issues of the subjective nature of individuals’ evaluation judgment: the model output is presented to decision-makers for their assessment and decision; but also often falls victim to the temptation of declaring some ‘optimal’ value of an objective performance variable to be the proper basis for a decision. 

The familiar ‘parliamentary’ approach to planning and policy decision-making accepts the presentation of a proposed plan and the exploration of its ‘pros and cons’ — arguments as the proper basis for decisions (but then reverts to voting as the decision-making tool, which potentially permits disregarding all concerns of the voting minority, a different problem). The typical arguments in such discussions or debates rarely get beyond invoking ‘qualitative’ advantages and disadvantages: — evaluation ‘aspects’  in the vocabulary of formal evaluation procedures –,  and refers to quantitative effects or consequences (criteria‘) only in  a rhetorical  and not very systematic manner.  The typical ‘planning argument’ assumption of the conditions under which its main instrumental premise will hold (ref. the section__ on argument evaluation) is usually not even made explicit — taken for granted — even though it would actually call for a thorough description of the entire system into which the proposed plan will intervene, complete with all its quantitative data and expected tracks into the future. 

These considerations suggest that the concepts of criteria and criteria functions can be seen as the (often missing) link between the systems modeling view, the argumentative discourse, and the formal evaluation approach to planning decision-making. 

Criteria types

Another understanding of ‘criterion’ pertains to the assessment of that explanation: the level of confidence of a claim, the level of plausibility of arguments; and the degree of importance of an aspect. Since the assessment of plans involves expected future states of affairs that cannot be observed and measured as matters of fact in reality (not being ‘real’ yet, just estimated, predicted), those estimates even of ‘objective’ features must be considered ‘subjective‘ no matter how well supported by calculations, systems simulation, and  consistency of past experience with similar cases. The degree of certainty or plausibility of such relationships may be considered by some as an ‘objective fact’ feature of the matter — but the decisions we make and  refer to in our explanation of out judgments  to each other are subjective estimates of  that degree — and that will the result of  the discussion, debate, deliberation of the matter at hand. These criteria may be called ‘judgment assessment criteria’.

‘Solution performance’ criteria

These criteria are well known, and have been grouped and classified in various ways according to evaluation aspect categories, in architecture starting with Vitruvius‘ triad of aspects ‘firmness, utility (commodity) and delight’ (beauty). Interestingly enough, the explanation of beauty held more attention in terms of exploring measurable criteria such as proportion ratios than the  firmness and utility aspects. In the more recent ‘benefit/cost’  approach  that concern somehow disappeared or has been swallowed up in one of the‘benefit/cost’ categories that measures both kinds with the criterion of money, which arguably is more difficult to connect with beauty in a convincing manner. Meanwhile, engineering has made considerably more progress in actually calculating structural stability, bearing loads of beams and trusses, resistance on buildings to wind loads, thermal performance of materials etc. 

For all the hype about functional aspects, the development of adequate criteria has been less convincing: the size of spaces for various human activities, or walking distance between different rooms in places like hospitals or airports are admittedly easy to measure but all seem to be missing something important. That sense of something missing may have been a major impulse for the effort to get at ‘Quality’  in Christopher Alexander’s efforts to develop a ‘Pattern Language‘  for  architecture and environmental design. [3]; ‘Universal design’ looks at functional use concerns of spaces by people with various disabilities but has paid more attention to suggesting or prescribing actual design solutions than to develop evaluation criteria.  My explorations of a different approach try to assess the value of buildings by looking at the adequacy of places in the built environment for the human occasions they are accommodating, as well as the image the design of the place is conveying to occupants, and developing criteria  such as ‘functional occasion adequacy’ and ‘occasion opportunity density’, ‘image adequacy‘ [4]   Current concerns about ‘sustainability’ or ‘regenerative’  environmental design and planning seem to claim more, even dominant attention than earlier aspects; the development of viable evaluation criteria has not yet caught up with the sense of crisis. (An example is the attention devoted to the generation and emission of CO2 into the atmosphere: it seems to play a crucial role in global climate change — but the laudable proclamation of governments or industries of plans to‘achieve a level of  x  emission of CO2 within y years’ ) seem somewhat desperate (just doing something?) but not addressing the real effects either of the climate change itself, or the question of ‘what about the time after and up until date y’  — (paying ‘carbon offsets’ or taxes?). 

Measurement scales for criteria

In exploring more adequate criteria, to guide planning decisions it is necessary to look at how criteria are measured:  both to achieve better ‘objective’ basis for comparing alternative plans and to avoid neglecting important aspects just because they are and remain difficult to measure in acceptable objective ways, and will have to rely on subjective assessments by affected parties. 

The ‘qualitative’ assessment of evaluation aspects will  use judgments on the nominal and ordinal scale — both for the ‘goodness’ judgments and the ‘criteria’ . The fact that these are mostly subjective assessments does not relieve us from the need to explain to each other how they are distinguished, what we mean by certain judgments, and how they are related: that is, how ’criteria’ judgments explain ‘goodness’ judgments:  the question of criterion functions that usually focus on explaining how our subjective ‘goodness’ judgments  relate to (depend on)  ‘objectively measurable performance criteria’

Criterion functions

Types of criterion functions

The concept of ‘criterion function’ was defined as the demonstration of the relationship between subjective quality judgments and (usually) objective features or performance measures of the thing evaluated,  in the form of verbal explanation,  equations, or diagrams. 

A first kind of distinction between different kinds of such explanations can be drawn according to the scales of measurements used for both the quality judgments and the criteria. The following table shows some basic types based on the measurement scales used: 

Table 1 — Criterion function types based on judgment scales

For simplicity, the types for the difference and ratio scale are listed together as ‘quantitative’ kinds. Further distinctions may arise from consideration whether the scales in question are ‘bounded’  by some distinct value on one or both ends, or ‘unbounded‘ — towards +∞ or -∞.  

Another set of types are related to the attitudes of where the ‘best‘ and ‘worst‘ features are located:   The attitudes will call for different shapes of diagrams: 

“The more, the better”; “The less, the better”; “The value x on the criterion scale is best’,  smaller or larger values are worse”; “The value x on the criterion scale is worst;  lower or higher values are better”.

 

Further distinctions may arise from consideration whether the scales in question are ‘bounded’  by some distinct value on one or both ends, or ‘unbounded‘ — towards +∞ or -∞.  The attitude ‘the more, the better’ will have the ‘couldn’t be better’ score at infinity; while it will be at zero (or even -∞ ?)  for the opposite ‘the less, the better’ ; or the best or worst scores may be located at some specific value x of the performance criterion scale.

Criterion function examples

Table 8.2  Criterion functions type 1 and 2

Table 8.3  Criterion functions  type 3,4,5,6 

A common type 6 example with a bounded judgment scale and the performance scale bounded on zero at the low end and unbounded at +∞ at the other end is the following: Asked  to explain the basis of our subjective ‘goodness / badness’ (or similar) judgment about a proposed plan, we can respond by drawing a diagram showing the objective performance measurement scale with its units as a horizontal line, and the judgment scale on on the vertical axis. For example, judging the ‘affordability’ of proposed projects, on a chosen judgment scale of  -3 to +3, with +3 meaning ‘couldn’t be more affordable’ , the -3 meaning ‘couldn’t be more unaffordable’, and the midpoint of ‘zero’  meaning ‘can’t decide, don’t know, cannot make a judgment’. 

Figure 2  A type 6 criterion function of ‘affordability’ judgments related to the cost of a plan

In the following, the discussion will be focused mainly on functions of type 6 — judgments expressed on a +U to -U  scale (e.g. +3 to -3) with a midpoint of zero for ‘don’t know, can’t decide;  neither good nor bad’, and some quantitative scale for the performance criterion. 

The criterion function lines can take different shapes, depending on the aspect. For some like the cost aspect in the first example above, the rule ‘the more, the worse’ will call for a line declining towards the +∞ right; many aspects call for a ‘the more, the better’  rising from zero to (or -∞?) towards +∞ on the  opposite end  for others there may be a ‘best’  or ‘worst’ point in-between value. Some people may wish to have a building front in what is widely consider the ‘most beautiful’ proportion, the famous ratio 1:1.618…

Figure 8.3  — Four different ‘attitude’ curves of type 6 criterion functions

Expectations for Criterion functions; Questions

There are some aspects of rationality attached to the criterion function  concept. The line expresses the judgment of a cost-conscious client, and of course getting the project ‘for free’  would deserve the score of +3 ‘couldn’t be better / more affordable’. The line would approximate the bottom judgment of -3 towards infinity: for any cost however large, it could be even worse. So +3 and -3 judgment scores should be assigned only if the performance  r e a l l y  couldn’t be better or worse, respectively. Furthermore, we would expect the line to be smooth, in this case smoothly descending: it should not have sudden spikes or valleys. If the cost of a Plan A could be reduced somewhat, the resulting score for the revised solution A’ should not be lower than the score for the original version of A.  But should that prohibit superstitious evaluators from showing such dips in their criterion function lines, e.g. for superstitiously ‘evil’ numbers like 13? There are many building designs that avoid  heights resulting in floor levels with that number — or,  if the building are higher, just don’t show those floors on the elevator buttons?. Al

 Where should  a person’s judgment line ‘reasonably’ cross the 0 axis?  It might be the amount of money the client has set aside for the project budget: as long as it’s on the ‘+’ side, it’s ‘affordable’ and the lower the cost, the better;  the more, the worse and less affordable. This shows that the judgment line, the ‘criterion function‘,  will be different for different people: it is subjective, because it depends on the client’s budget (or credit limit), but ‘objectified’ (explained) by showing how the affordability judgment score relates to the actual expected cost. For a wealthier client, the line would shift toward the right; a less affluent client would draw it more steeply to the left. (Of course even this simple and plausible criterion might raise discussion:  does ‘cost’ mean ‘initial construction cost’  or ‘client equity’, or some time-related cost such as ‘average annual cost including mortgage payments etc.’ or ‘present value of all the costs, initial plus annual costs (each ‘discounted back to present worth) for a specified planning horizon”?)

There can be more complex criterion functions, with two or more variables defining judgment categories.  Figure 8.4 shows a function diagram for clothing sizes (pants) — a variation of the simple version in the above example. The sizes are roughly defined by ranges of  pants legs and waistlines (roughly, because different manufacturers styles and cuts will result in some overlap  in both dimensions). The judgment scale is the‘comfort’ or ‘fit’  experienced by a customer expressed e.g. on the +3 to -3 scale. The ‘best’ combination would of course be the ‘bespoke’ solution that can only be achieved by the tailor creating the garment for the specific measurements of  each individual customer. The ‘fitness’ judgments — in the third dimension — will be a smooth mountain with its  +3 top located above the specific measurement, with widening altitude lines (isohypses) for less perfect fits. The ‘so-so’,  or ‘just acceptable’ range would  cover an  area  within one of the ‘size’  regions, or actually overlapping borders. The area would likely be a kind of ellipse with a narrower range for leg length and allowing for a greater variation of waistline (accommodated by a series of  holes in the belt, for before-and after dinner adjustment…).  This example also demonstrates nicely that the evaluation judgment ‘fit’ is a personal, ‘subjective’ one even when it involves an ‘objectively’ measurable variable. 

Figure 8.4  —  A ‘3D’ criterion function for a ‘feature’ domain defined by two variables.

In his Berkeley lectures, Rittel proposed mathematical equations for the four basic function shapes, to calculate the judgment scores for different performance values. This may be useful for evaluation tools that have been agreed upon as‘standards’, such as government regulations.   However, for lay participants to express their individual assessments, to specify the different parameters to generate the specific curves would be unrealistic. Should  they not be a fuzzy broad band expressing the approximate nature of these judgments, instead of a crisp fine line? 

Figure 8.5  ‘Should the criterion function be drawn as ‘fuzzy’ lines? 

 So it will be more practical to simply ask participants to draw their personal line by hand, with  a fat pencil or brush, after indicating the main preferences about the location of ‘best‘ and ‘worst’ performance, and where the lines should cross the center ‘zero’ judgment axis. 

Equations expressing evaluation judgments would undoubtedly be desirable if not necessary for AI tools that might aim to use calculations with successive approximation to find ‘optimal’ solutions. But whose evaluations should those be? The discussion of evaluation so far has shown that while the judgment part of objectification are subjective; getting ‘universal’ or societally accepted ‘norms would require agreed-upon (or imposed, by evaluation or systems consultants)  aggregated ‘curves’.(see the sections on aggregation and decision criteria). Some authors [5, 6 ]simplify or circumvent this issue by proposing simple straight lines — for which equations will be easy to establish —  such as in one of the examples below, suggesting that these should be used as common policy tools like government regulations. This needs more discussion. The example shows a criterion function for the predicted total cost for a specific project, with the function crossing the ‘zero’ line at some ‘neutral’ or ‘acceptable’ value of cost for a building of the given size.  The question arises why a solution achieving a lower cost than that of 105,000 (where the line breaks from +5 into the sloped line) should not get a better judgment; but it would be more cumbersome to establish the equation for a curve that gradually approximates the zero cost on the left and the infinitely high cost on the right and also crosses the zero judgment line at the selected neutral value of 210.000. The equation shown in the second line Y.1  of the example is easy to generate and use, but arguably somewhat arbitrary. 

Figure 8.5  Simplifying the judgment curves with straight lines  [5]

Criteria for discourse contributions; judgments, argument assessment. 

Could the criterion functions be modified according to the plausibility (confidence) assessment for the evaluation of argument plausibility (whose deontic ought-premise is identical or 

conceptually linked to the ‘goodness’ aspect of  formal evaluation of the Musso-Rittel type? The corresponding criterion function lines would be the more ‘flattened’ towards the ‘zero’ line honestly representing ‘don’t know’ the closer the plausibility judgment for the deontic premise approaches that zero value (on the assumed -1 to +1 plausibility scale). This would still express the person’s preferences on the criterion, but adjust the impact of that aspect according to the level of confidence of the solution pursuing and achieving that goal.  

Figure  8.6 —  Plausibility – modified quality criterion functions.

There are also questions about the way these functions might be manipulated to generate ‘bias’. For example: A participant who has assigned the aspect in question a low weight of relative importance might be tempted to draw the criterion function line steeper or less steep to increase that aspect’s impact on the overall assessment. 

The extent to which participants will be led to consider evaluation aspects or arguments contributed by other participants and make them part of their own evaluation will depend on the  

degree of confidence, plausibility, credibility with which these entries are offered:  how well is a claim ‘supported’ by evidence or further arguments? This aspect is  sometimes discussed  in general terms of  ‘breadth‘  and ‘depth’  of support, or in more ‘scientific‘ terms  by the amount of ‘data’, the rigor of collecting the data and analyzing its logic of inference and statistical ‘significance’. It should be obvious that simple measures such as ‘counts’ of  claims oaf breadth (the number of  different claims made in support of a judgment) and depth  (the number of claims supporting claims and their support, aspects, sub-aspects,  arguments and support of premises and evidence for each premise etc.) are meaningless if those claims lack credibility and plausibility, or are entirely ‘made up’. 

Support of claims or judgments can be ‘subjective’ or ‘objective’. But the general attitude is that ‘objective’ claims well supported by ‘facts’ and ‘scientific validity‘ carry a greater strength  of obligation for others to accept in their ‘due consideration‘ than ‘subjective’ claims without further supporting evidence or argument that each person may have the right to believe but cannot expect everybody else to accept as theirs. A possible rule may be to introduce the concern for others as a standard general aspect in the overall evaluation aspect list, but keep the impacts of  a criterion on one’s own part and the impact on others on separate aspects and criterion functions. The weight or impact (i.e. how much are our judgments influenced  by somebody’s argument or claim) we then accord that aspect in our own judgment will very much depend on the resulting degree of plausibility. All this will be a recurring issue for discussion.

The ‘criterion function’  for this second kind of criterion:  plausibility,  will take a slightly different  form (and mathematical expression, if any) than those pertaining to the object goodness assessment. For example, the plausibility of a pro or con argument  can be expressed as the product (multiplication) of the plausibility judgments pl of all its (usually two or three)  premises:  

Argpl(i)  = pl(FI-premise) * pl(D-premise) *pl(F-premise) * pl(Inference rule)

of the standard planning argument :  

D(PLAN A)  (Plan A ought to be adopted) because 

FI(A–> Outcome B given conditions C) (Given C, A will produce B) , and

D(B) (B ought to be achieved), and

F(C) (Conditions C are / will be present).

The ‘criterion function’  for this assessment for only the two main premises  FI (A –> B)  and D(B) , takes the form of a 3D surface in the ‘plausibility cube:

Figure  8.7 — Argument plausibility as a function of  premise plausibility (two premises)

Here, D(x) denotes the Plan proposal, D(y) is the deontic claim of desired outcome, and F(xRELy)  is the factual-instrumental claim that Plan x will produce outcome y.  [7]

Evaluation and time: changing future performance levels

All assessments of ‘goodness’  (quality) or plausibility of plans are expectations for the future. So judgments about a plan’s effectiveness are — explicitly or implicitly — based on some assumption about the time in the future at which the expected performance will be realized. For some kinds of projects, it will be meaningful to talk about plan effects immediately on implementation: ‘fixing’ a problem for good when executed.  The Musso/Rittel  and similar criterion functions are based on that assumption.  However, many if not most public plans will reach full effectiveness only after some initial ‘shake-down’ period  (during which the problem may actually be expected to first get worse before getting better) and to different degrees over time.  For most plans, a ‘planning horizon’ or life span is assumed; expected benefits will vary over time, and eventually decline and stop entirely. The only specific assessment criteria  for this are the computations  of economic aspects:  initial versus recurring costs and benefits, and their conversions into‘present value’ or ‘annual’ or ‘future value’ equivalents , based on personally different discount rates for the conversion, and estimates of ‘planning horizons’. 

 As soon as this is taken into consideration, it becomes obvious that difference of opinion may be based on different assumptions about this, and the need arises for making these assumptions more explicit. This means that the expected ‘performance track’ of different plan solutions over time should be established  and made visible in the evolving criterion functions. This aspect (to my knowledge) has not been adequately explored and integrated into evaluation practice.  

 Figure 8.8.  Evaluation of alternative plans over time

The diagram is a first attempt at displaying this. It could be seen as the task of comparing  two different plans for dealing with the issue of human CO2 emissions, with the expected ‘do nothing‘ alternative. One plan (A) will  show continuing emission levels for some period before reversing the direction of the trend);  the other  (B)  is assumed to take effect immediately but not as strongly as plan A.  This suggests that the better basis of comparison would be the ‘areas’  of ‘improvement’ or ‘worsening’ in the judgment surface  over time — in the diagram shaded for plan A. 

Anther question arising in this connection — besides the suggestion above that the expected trends should be drawn as ‘fuzzy’, broad tracks, rather than the crisp lines printed out by the computer simulations — is the aspect that any plausibility (probability) estimates for the predictions involved are also likely to decline, from initial optimistic certainty down and more honestly towards the zero middle line — “not sure”, “don’t know”. 

Preliminary conclusions

For discussion, including criterion functions in the deliberation process offers some interesting improvement possibilities compared to conventional practice:

  •   A more detailed, specific description of the basis of judgment of participants in the discourse;
  •   The ability to develop overall  group or ‘community’ measures of collective merit of proposed plans, with specific indication of the plan details about which participants disagree, and thus opportunities for finding plan modifications leading to improvements of assessment and acceptance.  For example, while it is possible to construct functions based on preference rankings of solutions, (which do not show the spread of the ranking scores, nor the overall location of ranking clusters on a performance measure scale), the comparison of criterion function curves can facilitate the identification of ‘overlap’ regions of acceptable solutions.
  • It should be obvious that overall ‘group’  assessment indicators must be based on some aggregation of individual (or partial group) judgment scores; these can then be used for varieties of Pareto-type analysis and decision criteria. Instead of just using e.g. ‘averaged’ group scores — or scores ‘weighted’ by the number of members of the different subgroups or parties — decision criteria based on such aspects as degrees of improvement offered to different subgroups by the different plan versions can be developed. (See the section on decision criteria, which should be clearly distinguished from the evaluation criteria discussed here, used for individual assessment) 

The questions arising from this tentative discussion suggest that this part of the evaluation component of planning discourse, and especially public planning discourse, with wide public participation  by affected parties spread over different administrative constituencies, need more research and discussion. 

–o–

References

[1]  Musso, Arne and Horst Rittel: “Über das Messen der Güte von Gebäuden” in “Arbeisberichte zur Planungsmethodik 1” Stuttgart 1969. In English: “Measuring the Performance of Buildings”, Report about a Pilot Study, Washington University, St. Louis, MO 1967. 

[2 ] Dehlinger, Hans:  “Deontische Fragen, Urteilsbildung, Bewertungssysteme”  in “Die Methodische Bewertung: Ein Instrument des Architekten”: Festschrift zu Prof. Musso’s 65. Geburtstag. Technische Universität Berlin 1993.

[3]  Alexander, C. et al. “A Pattern Language”  Oxford University Press, New York 1977. 

[4]  Mann, T. “Built Environment Value as a Function of Occasion and Image”  Academia.edu    Also::

‘Rigatopia”,  LAP, Lambert Academic Publishing, Saarbrücken  2015

[5] Musso, Arne “Planungsmodelle in der Architektur”  Technische Universität Berlin, Fachgebiet Planungsmethoden: (Berlin 1981)

[6] Höfler, Horst: “Problem-darstellung und Problem-lösung in der Bauplanung”, IGMA-Dissertationen 3, 

Universität Stuttgart 1972.

[7] Mann, Thorbjoern:  “The Fog Island Argument”  Xlibris, 2009. Also: “The Structure and Evaluation of Planning Arguments” in INFORMAL LOGIC,  Dec. 2010.

EVALUATION IN THE PLANNING DISCOURSE — AI SUPPORT OF EVALUATION IN PLANNING

Part of a series of  issues to clarify the role of deliberative evaluation in the planning and policy-making process. Thorbjørn Mann, February 2020.

The necessity of information technology assistance

A planning discourse support platform aiming at accommodating projects that cannot be handled by small F2F ‘teams’ or deliberation bodies, must use current (or yet-to-be developed) advanced information technology, if only just to handle communication. The examination of evaluation tasks in such large project discourse, so far, also has shown that serious, thorough deliberation and evaluation can become so complex that information technology assistance for many tasks will seem unavoidable, whether in form of simple data management or more sophisticated ‘artificial intelligence‘.

So the question arises what role advanced Artificial or Augmented Intelligence tools might play in such a platform. A first cursory examination will begin by surveying the simpler data management (‘house-keeping’) aspects that have no direct bearing on actual ‘intelligence’ or ‘reasoning’ and evaluation in planning thinking, and then exploring possible expansion of the material being assembled and sorted, into the intelligence assistance realm. It will be important to remain alert to the concern of where the line between assistance to human reasoning and substituting machine calculation results for human judgment should be drawn.

‘House-keeping’ tasks

a. File maintenance. A first ‘simple’ data management task will of course be to gather and store the contributions to the discourse, for record-keeping, retrieval and reference. This will apply to all entries, in their ‘verbatim‘ form, most of which will be in conversational language. They may be stored in simple chronological order as they are entered, with date and author information. A separate file will keep track of authors and cross-reference them with entries and other actions. A log of activities may also be needed.

b. ‘Ordered’, or ‘formatted’ files. For a meaningfully orchestrated evaluation in the discourse, it will be necessary to check for and eliminate duplication of essential the same information, to sort the entries, for example according to issues, proposals, arguments, factual information, — perhaps already in some formatted manner — and to keep the resulting files updated. This may already involve some formatting of the content of ‘verbatim’ entries.

c.  Preparation of displays, for overview. This will involve displays of ‘candidates’ for
decision, the resulting agenda of accepted candidates; ‘issue maps’ of the evolving discussion, evaluation and decision results and statistics.

d. Preparation of evaluation worksheets.

e. Tabulating, aggregating evaluation results for statistics and displays.

‘Analysis’ tasks, examples

f. Translation. Verbatim entries submitted in different languages and their formatted ‘content’ will have to be translated into the languages of all participants. Also, entries expressed in ‘discipline jargon’ will have to be translated into conversational language.

g. Entries will have to be checked for duplication of essential identical content, expressed in different words (to avoid counting the same content twice in evaluation procedures).

h. Standard information search (‘googling’) for available pertinent information already
documented by existing research, data bases, case studies etc. This will require the selection of search terms, and the assessment of relevance of found items, then entered into as separate section of the ‘verbatim’ file.

i. Entered items (verbal contributions and researched material) will have to be formatted for evaluation; arguments with unstated (‘taken for granted’) premises must be completed with all premises stated explicitly; evaluation aspects, sub-aspects etc must be ordered into coherent ‘aspect trees’.  (Optional: Information claims found in searches may be combined to form ‘new’ arguments that have not been made by human participants).

j. Identifying argument patterns (inference rules) of arguments, and checked (to alert participants for validity problems and contradictions)

k. Normalization of weight assignments, aggregation of judgments and arguments and display if different aggregation result (different aggregation functions) as well as their effect on different decision criteria will have to be prepared and displayed.

l. More sophisticated support examples would be the development of systems models of the ‘system’ at hand, (for example, constructing cause-effect connections and loops for the factual-instrumental premises in arguments) to predict performance of proposed solutions, to simulate the behavior of the resulting system in its environment over time.

The boundary between human and machine judgments

It should be clear from preceding sections that general algorithms should not be used to generate evaluative judgments (unless there are criteria expressed in regulations, laws, or norms, to expressly substitute for human judgment.) Any calculated statistics of participant judgments should be clearly identified as ‘statistics’ of individuals’ judgments, not as ‘group judgments’. The boundary issue may be illustrated with the examination of the idea of complete ‘objectification’ or explanation of a person’s basis of judgment, with the ‘formal evaluation’ process explained in that segment. Complete description of judgment basis would require description of criterion functions for all aspect judgments, the weighting of all aspects and sub-aspects etc., and the estimates of plausibility (probability) for a plan to meet the performance expectations involved. This would allow a person A to make judgments on behalf of another person B, while not necessarily sharing B’s basis of judgment. Imagining a computer doing the same thing is meaningful only if all those values of B’s judgment basis can be given to the computer. The judgments would then be ‘deliberated’ and fully explained (not necessarily justified or mandatory for all to share).

In practice, doing that even for another person is too cumbersome to be realistic. People usually shortcut such complete objectification, making decisions with ‘offhand’ intuitive judgments — that they do not or cannot explain. That step cannot be performed by a machine, by definition: the machine must base its simulation of our judgment basis on some explanation. (Admittedly, It could be simulating the human equivalent of tossing a coin: randomly, though most humans would resent describing their intuitive judgments to be called ‘random’). And vague reference is usually made to ‘common sense’ or otherwise societally accepted values, obscuring and sidestepping the problem of dealing with the reality of significantly different values and opinions.

Where would the machine get the information for making such judgments if not from a human? Any algorithm for this would be written by a human programmer, including the specifics for obtaining the ‘factual’ information needed to develop even the most crude criterion function. A common AI argument would be that the machine can be designed to observe (gather the needed factual information) and ‘learn’ to assemble a basis of judgment, for measurable and predictable objectives such as ‘growth’ or stability (survival) of the system. The trouble is that the ‘facts’ involved in evaluating the performance and advisability of plans are not ‘facts’ at all:  They are estimates, predictions of future facts, so they cannot be ‘observed’ but must be extrapolated from past observations by means of some program. And we can deceive ourselves to accept information about the desirability of ‘ought’ or ‘goodness aspects of a plan as ‘factual’ data only by looking at statistics, (also extrapolated into the future) or legal requirements — that must have been adopted by some human agent or agency.

To be sure: these observations are not intended to dismiss the usefulness of AI (that should be called augmented intelligence) for the planning discourse. They are trying to call attention to the question of where to draw the boundary between human and machine ‘judgment’. Ignoring this issue can easily lead to development of processes in which machine ‘judgment’ — presented to the public as non-partisan, ‘objective’, and therefore more ‘correct’ than human decisions, but inevitably programmed to represent some party’s intentions and values — can become sources of serious mistakes, and tools of oppression. This brief sketch can only serve as encouragement to more thorough discussion.


— o —

EVALUATION IN THE PLANNING DISCOURSE — THE DIMINISHING PLAUSIBILITY PARADOX

Thorbjørn Mann,  February 2020

THE DIMINISHING PLAUSIBILITY PARADOX

Does thorough deliberation increase or decrease confidence in the decision?

There is a curious effect of careful evaluation and deliberation that may appear paradoxical to people involved in planning decision-making, who expect such efforts to lead to greater certainty and confidence in the validity of their decisions. There are even consulting approaches that derive measures of such confidence from the ‘breadth’ and ‘depth’ achieved in the discourse.

The effect is the observation that with well-intentioned, honest effort to give due consideration and even systematic evaluation  to all concerns — as expressed e.g. by the pros and cons of proposed plans perceived by affected and experienced people, –, the degree of certainty or plausibility for a proposed plan actually seems to decrease, or move towards a central ‘don’t know’ point on a +1 to -1 plausibility scale. Specifically: The more carefully breadth (meaning coverage the entire range of all aspects or concerns) and depth (understood as the thorough examination of the support — evidence and supporting arguments — of the premises of each ‘pro’ and ‘con’ argument) are evaluated, the more the degree of confidence felt by evaluators moves from initial high support (or opposition) towards the central point ‘zero’  on the scale, meaning ‘don’t know; can’t decide’.

This is of course, the opposite of what the advice to ‘carefully evaluate the pros and cons’ seem to promise, and what approaches striving for breadth and depth actually appear to achieve. This creates a suspicion that either the method for measuring the plausibility of all the pros and cons must be faulty, or that the approaches relying on the degree of breadth and depth directly as equivalent to greater support are making mistakes. So it seems necessary to take a closer a look at this apparently counterintuitive phenomenon.

The effect has first been observed in the course of the review for a journal publication of an article on the structure and evaluation of planning arguments [1] — several reviewers pointed out what they thought must be a flawed method of calculation.

Explanation of the effect

The crucial steps of the method (also explained in the section on planning argument assessment) are the following:

– All pro and con arguments are converted from their often incomplete, missing- premises state to the complete pattern explicitly stating all premises, (e.g. “Yes, adopt plan A because 1) A will lead to effect B given conditions C, and 2) B ought to be aimed for, and 3) conditions C will be present”).

– Each participant will assign plausibility judgments to each premise, on the +1 /-1 scale where the +1 stands for complete certainty or plausibility, the -1 for complete certainty that the claim is not true, or totally implausible (in the judgment of the individual participant), and the center point of zero expressing inability to judge”don’t know; can’t decide’. Since in the planning argument, all premises are estimates or expectations of future states — effects of the plan, applicability of the causal rule that connects future effects or ‘consequences’ with actions of the plan, and the desirability or undesirability of those consequences, complete certainty assessments (pl = +1, or -1) for the premises must be considered unreasonable; so all the plausibility values will be somewhere between those extremes.

– Deriving a plausibility value for the entire argument from these plausibility judgments can be done in different ways: The extreme being to assign the lowest premise plausibility judgment prempl to the entire argument, expressing an attitude like ‘the strength of a chain is equal to the strength of its weakest link’. Or the plausibility values can be multiplied:  The Argument plausibility: for argument i 

            Argpl(i) =  (prempl(i,j))  for all premises j of argument i

Either way, the resulting argument plausibility cannot be higher than the premise plausibilities.

– SInce arguments do not carry the same ‘weight’ in determining the overall plausibility judgment, it is necessary to assign some weight factor to each argument plausibility judgment. That weight will depend on the relative importance of the ‘deontic’ (ought) premises; and approximately expressed by assigning each of the deontic claims in all the arguments a weight between zero and +1, such that all the weights add up to +1. So the weight of argument i will be the plausibility of argument i times the weight of its deontic premises: Argw(i) = Argpl(i) x w(i)

– A plausibility value for the entire plan, will have to be calculated from all the argument weights. Again, there are different ways to do that (discussed in the section of aggregation) but an aggregation function such as adding all the argument weights (as derived by the preceding steps) will yield a plan plausibility value on the same scale as the initial premise and argument plausibility judgments. It will also be the result of considering all the arguments, both pro and con; and since the argument weights of arguments considered ‘con’ arguments in the view of individual participants will be subtracted from the summed-up weight of ‘pro’ arguments, it will be nowhere near the complete certainty value of +1 or -1, unless of course the process revealed that there were no arguments carrying any weight at all on the pro or con side. Which is unlikely since e.g. all plans have been conceived from some expectation of generating some benefit, and will carry some cost or effort, etc.

This approach as described thus far can be considered a ‘breadth-only’ assessment, justly so if there is no effort to examine the degree of support of premises. But of course the same reasoning can be applied to any of the premises — to any degree of ‘depth’ as demanded by participants from each other. The effect of overall plan plausibility tending toward the center point of zero (‘don’t know’ or ‘undecided’), compared with initial offhand convincing ‘yes: apply the plan!) or ‘no- reject!’ reactions will be the same — unless there are completely ‘principle’-based or ‘logical or physical ‘impossibility’ considerations, in plans that arguably should not even have reached the stage of collective decision-making.

Explanation of the opposite effect in ‘breadth/depth’ based approaches

So what distinguishes this method from approaches that claim to use degrees of ‘breadth and depth’ deliberation as measures justifying the resulting plan decisions? And in the process, increases the team’s confidence in the ‘rightness’ of their decision?

One obvious difference — that must be considered a definite flaw,– is that the degree of deliberation, measured by the mere number of comments, arguments, of ‘breadth’ or ‘depth’, does not include assessment of the plausibility (positive or negative) of the claims involved, nor of their weights of relative importance. Just having talked about the number of considerations, without that distinction, cannot already be a valid basis for decisions, even if Popper’s advice about the degree of confidence in scientific hypotheses we are entitled to hold is not considered applicable to design and planning. (“We are entitled to tentatively accept a hypothesis to the extent we have given our best effort to test, to refute it, and it has withstood all those tests”…)

Sure, we don’t have ‘tests’ that definitively refute a hypothesis (or ‘null hypothesis’) that we have to apply as best we can, and planning decisions don’t rest or fall on the strength of single arguments or hypotheses. All we have are arguments explaining our expectations, speculations about the future resulting from our planning actions — but we can adapt Popper’s advice to planning: “We can accept a plan as tentatively justified to the extent we have tried our best to expose it to counterarguments (con’s) and have seen that those arguments are either flawed (not sufficiently plausible) or outweighed by the arguments in its favor.”

And if we do this, honestly admitting that we really can’t be very certain about all the claims that go into the arguments, pro or con, and look at how all those uncertainties come together in totaling up the overall plausibility of the plan, the tendency of that plausibility to go towards the center point of the scale looks more reasonable.

Could these consideration be the key to understand why approaches relying on mere breadth and depth measurements may result in increased confidence of the participants in such projects? There are two kinds of extreme situations in which it is likely that even extensive breadth and depth discussions can ignore or marginalize one side or the other of necessary ‘pro’ or ‘con’ arguments.

One is the typical ‘problem-solving’ team assembled for the purpose of developing a ‘solution’ or recommendation. The enthusiasm of the collective creative effort itself (but possibly also the often invoked ‘positive’ thinking, defer judgment so as to not disrupt the creative momentum, as well a the expectation of a ‘consensus’ decision?) may focus the thinking of team members on ‘pro’ arguments, justifying the emerging plan — but neglecting or diverting attention from counterarguments. Finding sufficient good reasons for the plan being enough to make a decision?

An opposite type of situation is the ‘protest’ demonstration, or events arranged for the express purpose of opposing a plan. Disgruntled citizens outraged by how a big project will change their neighborhood: counting up all the damaging effects: Must we not assume that there will be a strong focus on highlighting the plan’s negative effects or potential consequences: assembling a strong enough ‘case’ to reject it? In both cases, there may be considerable and even reasonable deliberation in breadth and depth involved — but also possible bias due to neglect of the other side’s arguments.

Implications of the possibility of decreasing plan plausibility?

So pending some more research into this phenomenon, — if found to be common enough to worry about, — it may be useful to look at what it means: what adjustments to common practice it would suggest, what ‘side-stepping’ stratagems may have evolved due to the mere sentiment that more deliberation might shake any undue, undeserved expectations in a plan. Otherwise, cynical observers might recommend throwing up our arms and leaving the decision to the wisdom of ‘leaders’ of one kind or another, in the extreme to oracle-like devices — artificial intelligence from algorithms whose rationales remain as unintelligible to the lay person as the medieval ‘divine judgment’ validated by mysterious rituals (but otherwise amounting to tossing coins?).

Besides the above-mentioned research into the question, examining common approaches on the consulting market for potential vulnerability to provisions to overplay the tendency would be one first step. For example, adding plausibility assessment to the approaches using depth and breadth criteria would be necessary to make them more meaningful.

The introduction of more citizen participation into the public planning process is an increasingly common move that has been urged — among other undeniable advantages such as getting better information about how problems and the plans proposed to solve them actually affect people — to also make plans more acceptable to the public because the plans then are felt to be more ‘their own’. As such, could this make the process vulnerable to the above first fallacy of overlooking negative features? If so, the same remedy of actually including more systematic evaluation into the process might be considered.

A common temptation by promoters of ‘big’ plans can’t be overlooked: to resort to ‘big’ arguments that are so difficult to evaluate that made-up ‘supporting’ evidence can’t be distinguished from predictions based on better data and analysis (following Machiavelli’s quip about ‘the bigger the lie, the more likely people will buy it’…). Many people already are suggesting that we should return to smaller (local) governance entities that can’t offer big lies.

Again: this issue calls for more research.

[1]   “The Structure and Evaluation of Planning Arguments”  Thorbjoern Mann, INFORMAL LOGIC  Dec. 2010.

— o —

EVALUATION IN THE PLANNING DISCOURSE — PROCEDURAL AGREEMENTS

An effort to clarify the role of deliberative evaluation in the planning and policy-making process.  Thorbjørn Mann,  February 2020

PROCEDURAL AGREEMENTS FOR EVALUATION

The need for procedural agreements

Any group, team or assembly having decided to embark upon a common evaluation / deliberation task aimed at a recommendation or decision about a plan, will have to adopt a set of agreements about the procedure to be followed, explicitly or implicitly. These rules can become quite detailed and complicated. Even the familiar ‘rules of order’ of standard parliamentary procedure, aiming at simple yea/nay decisions on ‘motions’ for the assembly to accept or reject, will become book-length guides (like ‘Robert’s Rules of Order’) that the chairpersons of such processes may have to consult when disputes arise. For simplified versions based on the expected simplicity of ending the discussions with a majority vote, and citizens’ familiarity with basic rules, agreements can even be tacitly taken for granted, without recourse to written guides. However, this no longer applies when the decision-making body engages in more detailed and systematic deliberation aiming at making the decisions more transparently justified by the evaluative judgments made on the comments in the discourse.

General overall agreements versus procedures for ‘special techniques’

This could be seen as a call for a general procedure that includes the necessary procedural rules, as an extension of the familiar parliamentary procedure. Would such a one-size-fits-all solution be appropriate? As the preceding sections of this study show, we now see not only a great variety of different evaluation tasks and context situations, but also a variety of different ‘approaches’ for such processes now on the ‘market’ — especially as they are assisted by new technology. Each one comes with different assumptions about the rules or ‘procedural agreements’ guiding the process. So it seems that the question is less one of developing and adopting one general-purpose pattern, than one of providing a ‘toolkit’ of different approaches that the participants in a planning process could choose from as the task at hand requires. That opportunity-step for choice must be embedded in a general and flexible overall process, than participants either would be familiar with already, or able to easily learn and agree to.

Once a special technique is selected, as decided by the group, its procedural steps and decision rules should then be explicitly agreed upon at the very beginning of the specific process — the more so, the ‘newer’ the approach, tools and techniques — so as to avoid disruption of the actual deliberation by disagreements about procedure later on. Such quibbles could easily become quite destructive and polarizing, and even their in-process resolution can introduce significant bias into the actual assessment work itself. It may be necessary to change some rules, as the participants learn more about the nature of the problem at hand. That process should be governed by rules set out in the initial agreements: A provision such as the ‘Next step’ proposed in the process for the overall planning discourse platform would offer that opportunity. [See ‘PDSS-REVISED’).

This seemingly matter-of-course step can become controversial because different ‘special techniques’ may involve different concepts and corresponding vocabulary to be used: even ‘systems’ approaches of different ‘generations’ are likely to use different labels for essentially the same things, which can result in miscommunication and misunderstanding or worse. New techniques and tools may require different responsibilities, behavior, decision modes, replacing rules still taken for granted: must new agreements be set ‘upfront’ to prevent later conflicts?

The main agreements — possibly different rules for different project types — then will cover the basic procedural steps, the ‘stopping rules’ for deciding when a decision can be said to have been accepted (since one of the key properties of ‘wicked problems’ is that there is nothing in the nature of the problem itself that tell problem-solvers that a solution has been reached and the the work can stop); decision criteria and modes according to which this should be done. For the details of the evaluation part itself, the kinds of judgments and judgment scales will have to be agreed upon, — so that e.g. a judgment score will have the same meaning for all participants. (These issues will be addressed in separate sections).

An argument can be made that efforts should made to preserve consistency between the overall approach and its frame of reference and vocabulary, and any ‘special techniques’ for evaluation within that process along the way.

Doing without cumbersome procedural rules?

There will be attempts to escape procedures felt to be too ‘cumbersome’ or bureaucratic, with an easier route to a decision. Majority voting itself can be seen as such an escape. Even easier are decision criteria such as ‘consent’ — declared, for example, by the chair that there are ‘no more objections’ combined with ‘time’s up’ — which may indicate that the congregation has become exhausted, rather than convinced of the advantages of a proposed plan, or dissuaded from voicing more ‘critical’ questions. But aren’t the conditions leading to ‘consent’ outcomes in some approaches — group size, seating arrangements, sequences of steps and phases — themselves procedural provisions?

Examples of aspects calling for agreements

Examples of different procedural agreements are the above-mentioned ‘rules of order’, the steps for determining the ‘Benefit/Cost Ratio’ of plans; provisions for ‘formal evaluation’ process of the ‘quality’ of a proposed plan or for the evaluation of a set of alternative proposals; agreements needed for evaluating the plausibility of a plan by systematic assessment of argument plausibility; the guides for a ‘Pattern Language’ approach to planning. (Some of these will be described in separate segments).

The procedural agreements cover aspects such as the following:
– The conceptual frame of reference and its vocabulary and corresponding techniques and displays;
– Proper ‘etiquette’ and behavior
The process steps (sequence), participant rights and responsibilities;
Formatting of entries as needed for evaluation;
– For the evaluation tasks: judgment scales and units, the meaning of the scores;
– The aggregation functions to be used to derive overall judgments from partial judgment scores and from individual participant scores to ‘group’ statistics and decision rules;
– Decision criteria and decision modes;
– The stopping rule(s) for the process.

Specific agreements for different evaluation ‘approaches’ and special techniques must then be discussed in the sections describing those methods.


–o–

EVALUATION IN THE PLANNING DISCOURSE — TIME AND EVALUATION OF PLANS

An effort to clarify the role of deliberative evaluation in the planning and policy-making process. Thorbjørn Mann, February 2020

TIME AND EVALUATION OF PLANS  (Draft, for discussion)

Inadequate attention to time in current common assessment approaches

Considering that evaluation of plans (especially ‘strategic’ plans) and policy proposals, by their very nature are concerned with the future, it is curious that the role of time has not received more attention, even with the development of simulation techniques that aim at tracking the behavior of key variables of systems over many years into the future. The neglect of this question, for example in the education or architects, can be seen in the practice of judging students’ design project presentations on the basis of their drawings and models.

The exceptions — for example in building and engineering economics — are looking at very few performance variables, with quite sophisticated techniques: expected cost of building projects, ‘life cycle cost’, return on investment etc., — to be put into relation to expected revenues and profit. Techniques such as ‘Benefit/Cost Analysis‘, which in its simplest form considers those variables as realized immediately upon implementation, also can apply this kind of analysis to forecasting costs and benefits and comparing them over time by methods for converting initial amounts (of money) to ‘annualized’ or future equivalents, or vice versa.

Criticism of such approaches amount to pointing out problems such as having to convert ‘intangible’ performance aspects (like public health, satisfaction, loss of lives) into money amounts to be compared, (raising serious ethical questions) for entities like nations, where the money amounts drawn from or entering the national budget hide controversies such as inequities in the distribution of the costs and benefits. Looking at the issue from the point of view of other evaluation approaches might at least identify the challenges in the consideration of time in the assessment of plans, and help guide the development of better tools.

A first point to be pointed out is that from the perspective of the formal evaluation process, for example, (See e.g. the previous section on the Musso/Rittel approach), measures like present value of future cost or profit, or benefit-cost ratio must be considered ‘criteria’ (measures of performance) for more general evaluation aspects, for among a set of (goodness) evaluation aspects that each evaluator must be weighted for their relative importance, to make up overall ‘goodness’ or quality judgments. (See the segments on evaluation judgments, criteria and criterion functions, and aggregation.) And as such, the use of these measures as decision criteria must be considered incomplete and inappropriate. However, in those approaches, the time factor is usually not treated with even the attention expressed in the above tools for discounting future costs and benefits to comparable present worth: For example, pro or con arguments in a live verbal discussion about expected economic performance often amount to mere qualitative comparisons or claims like ‘over the budget’ or ‘more expensive in the long run’. 

Finally, in approaches such as the Pattern language, (which makes valuable observations about ‘timeless’ quality of built environments, but does not consider explicit evaluation a necessary part of the process of generating such environments), there is no mention or discussion of how time considerations might influence decisions: the quality of designs is guaranteed by having been generated by the use of patterns, but the efforts to describe that quality do not include consideration of effects of solutions over time.

Time aspects calling for attention in planning

Assessments of undesirable present or future states ‘if nothing is done’

The implementation of a plan is expected to bring about changes in the state of affairs that is felt to be ‘problems’ — things not being as they ought to be, or ‘challenges’,‘opportunities’ calling for better, improved states of affairs. Many plans and policies aim at preventing future developments to occur, either as distinctly ‘sudden’ events or development over time. Obviously, the degree of undesirability depends on the expected severity of these developments; they are matters of degree that must be predicted in order for the plan’s effectiveness to be judged.

The knowledge that goes into the estimates of future change comes from experience: observation of the pattern and rate of change in the past, (even if that knowledge is taken to be well enough established to be considered a ‘law’). But not all such change tracks have been well enough observed and recorded in the past, so much estimate and judgment goes into the assumptions already about the changes over time in the past.

Individual assessments of future plan performance

Our forecasts for future changes ‘if nothing is done’, resting on such shaky past knowledge must be considered less that 100% reliable. Should our confidence in the application of that knowledge to estimates of a plan’s future ‘performance‘ then not be be acknowledged as equal (at best) or arguably less certain — expressed as deserving a lower ‘plausibility’ qualifier? This would be expressed, for example, with the pl — plausibility — judgment for the relationship claimed in the factual-instrumental premise of an argument about the desirability of the plan effects: “Plan A will result (by virtue of the law or causal relationship R) in producing effect B”.

This argument should be (but is often not) qualified by adding the assumption ‘given the conditions C under which the relationship R will hold’: the conditions which the third (factual claim) premise of the ‘standard planning argument’ claims is — or will be — ‘given’.

Note: ‘Will be’: since the plan will be implemented in the future, this premise also involves a prediction. And to the extent the condition is not a stable, unchanging one but also a changing, evolving phenomenon, the degree of the desirable or undesirable effect B must be expected to change. And, to make things even more interesting and complex: as explained in the sections on argument assessment and systems modeling: the ‘condition’ is never adequately described by a single variable, but actually represents the  evolving state of the entire ‘system’ in which the plan will intervene.

This means that when two people exchange their assumptions and judgments, opinions, about the effectiveness of the plan by citing its effect on B, they may likely have very different degrees (or performance measures in mind, occurring under very different assumptions about both R and C, — at different times.

Things become more fuzzy when the likelihood is considered that the desired or undesired effects are not expected to change things overnight, but gradually, over time. So how should we make evaluation judgments about competing plan alternatives, when, for example, one plan promises rapid improvement soon after implementation, (as measured by one criterion), but then slowing down or even start declining, while the other will improve at a much slower but more consistent rate? A mutually consistent evaluation must be based on agreed-upon measures of performance: measured at what future time? Over what future time period, aka ‘planning horizon’? This question will just apply to the prediction of the performance criterion — what about the plausibility and weight of importance judgments we need to offer complete explanation of our judgment base?  Is it enough to apply the same plausibility factor to forecasts of trends decades in the future, as the one we use for near future predictions? As discussed in the segment on criteria, the crisp fine forecast lines we see in simulation printouts are misleading: the line should really be a fuzzy track widening more and more, the farther out in time it extends?  Likewise: is it meaningful to use the same weight of relative importance for the assessment of effects at different times?

These considerations apply, so far, only to the explanation of individual judgments, and already show that it would be almost impossible to construct meaningful criterion functions and aggregation functions to get adequately ‘objectified’ overall deliberated judgment scores for individual participants in evaluation procedures.

Aggregation issues for group judgment indicators

The time-assessment difficulties described for individual judgments do not diminish in the task of construction decision guides for groups, based on the results of individual judgment scores. Reminder: to meet the ideal ‘democratic’ expectation that the community decision about a plan should be based on due consideration of ‘all’ concerns expressed by ‘all’ affected parties, the guiding indicator (‘decision guide’ or criterion) should be an appropriate aggregation statistic of all individual overall judgments. The above considerations show, to put it mildly, that it would be difficult enough to aggregate individual judgments into overall judgment scores, but even more so to construct group indicators that are based on the same assumptions about the time qualifiers entering the assessments.

This makes it understandable (but not excusable) why decision-makers in practice tend to either screen out the uncomfortable questions about time in their judgments, or resort to vague ‘goals’ measured by vague criteria to be achieved within arbitrary time periods: “Carbon-emission neutrality by 2050”, for example: How to choose between different plan or policies whose performance simulation forecasts do not promise 100% achievement of the goal, but only ‘approximations’ with different interim performance tracks, at different costs and other side-effects in society? But 2050 is far enough in the future to ensure that none of the decision-makers for today’s plans will be held responsible for today’s decisions…

“Conclusions’ ?

The term ‘conclusion’ is obviously inappropriate if referring to expected answers to the questions discussed. These issues have just been raised, not resolved; which means that more research, experiments, discussion is called for to find better answers and tools. For the time being, the best recommendation that can be drawn from this brief exploration is that the decision-makers for today’s plans should routinely be alerted to these difficulties before making decisions, carry out the ‘objectification’ process for the concerns expressed in the discourse (of course: facilitating discourse with wide participation adequate to the severity of the challenge of the project), and then admit that any high degree of ‘certainty‘ for proposed decisions is not justified. Decisions about ‘wicked problems’ are more like ‘gambles’ for which responsibility, ‘accountability’ must be assumed. If official decision-makers cannot assume that responsibility — as expressed in ‘paying’ for mistaken decisions, should they seek supporters to share that responsibility?

So far, this kind of talk is just that: mere empty talk, since there is at best only the vague and hardly measurable ‘reputation’ available as the ‘account‘ from which ‘payment‘ can be made — in the next election, or in history books. Which does not prevent reckless mistakes in planning decisions: there should be better means for making the concept of ‘accountability’ more meaningful. (Some suggestions for this are sketched in the sections on the use of ‘discourse contribution credit points’ earned by decision-makers or contributed by supporters from their credit point accounts,and made the required form of ‘investment payment’ for decisions.) The needed research and discussion of these issues will have to consider new connections between the factors involved in evaluation for public planning.


Overview

— o —

EVALUATION IN THE PLANNING DISCOURSE — SYSTEMS THINKING, MODELING AND EVALUATION IN PLANNING

An effort to clarify the role of deliberative evaluation in the planning and policy-making process. Thorbjørn Mann , February 2020. (DRAFT)

SYSTEMS THINKING / MODELING AND EVALUATION IN PLANNING

 

Evaluation and Systems in Planning  — Overview

The contribution of systems perspective and tools to planning.

In just about any discourse about improving approaches to planning and policy-making, there will be claims containing reference to ‘systems’: ‘systems thinking’, ‘systems modeling and simulation’, the need to understand ‘the whole system’, the counterintuitive behavior of systems. Systems thinking as a whole mental framework is described as ‘humanity’s currently best tool for dealing with its problems and challenges. There are by now so many variations, sub-disciplines, approaches and techniques, even definitions of systems and systems approaches on the academic as well as the consulting market, that even a cursory description of this field would become a book-length project.

The focus here is the much narrower issue of the relationship between this ‘systems perspective’ and various evaluation tasks in the planning discourse. This sketch will necessarily be quite general, not doing adequate justice to many specific ‘brands’ of systems theory and practice. However, looking at the subject from the planning / evaluation perspective will identify some significant issues that call for more discussion.

Evaluation judgments at many stages of systems projects and planning

A survey of many ‘systems’ contributions reveals that ‘evaluation’ judgments are made at many stages of projects claiming to take a systems view – like the finding that evaluation takes place at the various stages of planning projects whether explicitly guided by systems views or not. Those judgments are often not even acknowledged as ‘evaluation’, and done by very different patterns of evaluation (as described in the sections exploring the variety of evaluation judgment types and procedures.)

The similar aims of systems thinking and evaluation in planning

Systems practitioners feel that their work contributes well (or ‘better’ than other approaches) to the general aims of planning: such as
– to understand the ‘problem’ that initiates planning efforts;
– to understand the ‘system’ affected by the problem, as well as
– the larger ‘context’ or ‘environment’ system of the project;
– to understand the relationships between the components and agents, especially the ‘loops’ of such relationships that generates the often counterintuitive and complex systems behavior;
– to understand and predict the effects (costs, benefits, risks) and performance of proposed interventions in those systems (‘solution’) over time; both ‘desired’ outcomes and potentially ‘undesirable’ or even unexpected side-and after-effects;
– to help planners develop ‘good’ plan proposals,
– and to reach recommendations and/or decisions about plan proposals that are based on due consideration of all concerns for parties affected by the problem and proposed solutions, and of the merit of ‘all’ the information, contributions, insights and understanding brought into the process.
– To the extent that those decisions and their rationale must be communicated to the community for acceptance, these investigations and judgment processes should be represented in transparent, accountable form.

Judgment in early versus late stages of the process

Looking at these aims, it seems that ‘systems-guided’ projects tend to focus on the ‘early’ information (data) -gathering and ‘understanding’ aspects of planning – more than on the decision-making activities. These ‘early’ activities do involve judgment of many kinds, aiming at understanding ‘reality’ based on the gathering and analysis of facts and data. The validity of these judgments is drawn from standards of what may loosely be called ‘scientific method’ – proper observation, measurement, statistical analysis. There is no doubt that systems modeling, looking at the components of the ‘whole’ system, and the relationships between them, and the development of simulation techniques have greatly improved the degree of understanding both of the problems and the context that generates them, as well as the prediction of proposed effects (performance) of interventions: of ‘solutions’. Less attention seems to be given to the evaluation processes leading up to decisions in the later stages. Some justifications, guiding attitudes, can be distinguished to explain this:

Solution quality versus procedure based legitimatization on of decisions

One attitude, building on the ‘scientific method’ tools applied in the data-gathering and model-building phases, aims at finding ‘optimal’ (ideally, or at least ‘satisficing’) solutions described by performance measures from the models. Sophisticated computer-assisted models and simulations are used to do this; the performance measures (that must be quantifiable, to be calculated) derived from ‘client’ goal statements or from surveys of affected populations, interpreted by the model-building consultants: experts. One the one hand, their expert status is then used to assert validity of results. But on the other hand, increasingly criticized for the lack of transparency to the lay populations affected by problems and plans: questioning the experts’ legitimacy to make judgments ‘on behalf of’ affected parties. If there are differences of opinions, conflicts about model assumptions, these are ‘settled’ – must be settled – by the model builders in order for the programs to yield consistent results.

This practice (that Rittel and other critics called ‘first generation systems approach’) was seen as a superior alternative to traditional ways of generating planning decisions: the discussions in assemblies of people or their representatives, characterized by raising questions and debating the ‘pros and cons’ of proposed solutions – but then making decisions by majority voting or accepting the decisions of designated or self-designated leaders. Both of these decision modes obviously are not meeting all of the postulated expectations in the list above: voting implies dominance of interests of the ‘majority’ and potential disregard on the concerns of the minority; leader’s decisions could lack transparency (much like expert advice) leading to public distrust of the leader’s claim of having given due consideration to ‘all’ concerns affecting people.

There were then some efforts to develop procedures (e.g. formal evaluation procedures) or tools such as the widely used but also widely criticized ‘Benefit-Cost’ analysis tried to extend the ‘calculation based’ development of valid performance measures into the stage of criteria based on the assessment of solution quality to guide decisions. These were not equally widely adopted, for various reasons such as the complicated and burdensome procedures, again requiring experts to facilitate the process but arguably making public participation more difficult. A different path is the tendency to make basic ‘quality’ considerations ‘mandatory’ as regulations and laws, or ‘best practice’ standard. Apart from tending to set ‘minimum’ quality levels as requirement e.g. for building permits, this represents a movement to combine or entirely replace quality-based planning decision-making with decisions that draw their legitimacy from having been generated and following procedures.

This trend is visible both in approaches that specify procedures to generate solutions by using ‘valid’ solution components or features postulated by a theory (or laws): having followed those steps then validates the solution generated removes the necessity to carry out any complicated evaluation procedure. An example of this is Alexander’s ‘Pattern Language’ – though the ‘systems’ aspect is not as prevalent in that approach. Interestingly, that same stratagem is visible in movements that focus on processes aimed at mindsets of groups participating in special events, ‘increasing awareness’ of the nature and complexity of the ‘whole system’ but then rely on solutions ‘emerging’ from the resulting greater awareness and understanding that aim at consensus acceptance in the group for the results generated, that then do not need further examination by more systematic, quantity-focused deliberation procedures. The invoked ‘whole system’ consideration, together with a claimed scientific understanding of the true reality of the situation calling for planning intervention is a part of inducing that acceptance and legitimacy. A telltale feature of these approaches is that debate, argument, and the reasoning scrutiny of supporting evidence involving opposing opinions tends to be avoided or ‘screened out’ in the procedures generating collective ‘swarm’ consensus.

The controversy surrounding the role of ‘subjective’, feeling-based, intuitive judgments versus ‘objective’ measurable, scientific facts (not just opinions) as the proper basis for planning decisions also affects the role of systems thinking contributions to the planning process.

None of the ‘systems’ issues related to evaluation in the planning process can be considered ‘settled’ and needing no further discussion. The very basic ‘systems’ diagrams and models of planning may need to be revised and expanded to address the role and significance of evaluation, as well as argumentation, the assessment of the merit of arguments and other contributions to the discourse, and the development of better decision modes for collective planning decision-making.

–o–

EVALUATION IN THE PLANNING DISCOURSE: PROCEDURE EXAMPLE 2: EVALUATION OF PLANNING ARGUMENTS


An effort to clarify the role of deliberative evaluation in the planning and policy-making process. Thorbjørn Mann, January 2020. (Draft)

PROCEDURE EXAMPLE 2:
EVALUATION OF PLANNING ARGUMENTS (PROS & CONS)

Argument evaluation in the planning discourse

Planning, like design, can be seen as an argumentative process (Rittel): Ideas and proposals are generated, questions are raised about them. The typical planning issues — especially the ‘deontic’ (ought-) questions about what the plan ought to be and how it can be achieved — generate not only answers but arguments — the proverbial ‘pros and cons’ . The information needed to make meaningful decisions — based on ‘due consideration’ of all concerns by all parties affected by the problem the plan is aiming to remedy, as well as by any solution proposals, is often coming mainly via those pros and cons. Taking this view seriously, it becomes necessary to address the question of how those arguments should be evaluated or‘weighed’ . After all, those arguments are supporting contradictory conclusions (claims), so just ‘considering. is not quite enough.

Argumentation as a cooperative rather than adversarial interaction

The very concept of the‘argumentative view of planning is somewhat controversial because many people misunderstand ‘argument’ itself as a nasty adversarial, combative, uncooperative phenomenon, a ‘quarrel’ . (I have suggested the label ‘quarrgument’ for this). But ‘argument’ is originally understood as a set of claims (premises) that together support another claim, the ‘conclusion. For planning, arguments are items of reasoning that explore the ‘pros and cons about plans; and an important underlying assumption is that we ‘argue’ — exchange arguments with others because we believe that the other will accept or consider the position about the plan we are talking about because the other already believes or accepts the premises we offer, — or will do so once we offer the additional support we have for them. It is unfortunate that even recent research on computer-assisted argumentation seems to be stuck in the ‘adversarial’ view of arguments, seeing arguments as ‘attacks’ on opposing positions rather than a cooperative search for a good planning response to problems or visions for a better future.

‘Planning arguments’

There is another critical difference between the arguments discussed in traditional logic textbooks and and the kinds I call ‘planning arguments: The traditional argumentation concern was to establish the truth or falsity of claims about the world, and that the discussion — the assessment of arguments — will ‘settle’ that question in favor of one or the other. This does not apply to planning arguments: The planning decision does not rest on single ‘clinching’ arguments but on the assessment of the entire set of pros and cons. There are always real expected benefits and real expected costs, and as the proverbial saying has it, they must be ‘weighed’ against one another to lead to a decision. There has not been much concern about how that ‘weighing’ can or should be done, and how that process might lead to a reasoned judgment about whether to accept or reject a proposed plan. I have tried to develop a way to do this — a way to explain what our judgments are based on — beginning with an examination of the structure of ‘planning arguments.

The structure of planning arguments and their different types of premises

I suggest that planning arguments can be represented in a following general ‘standard planning argument’ form, the simplest version being the following ‘pro’ argument pattern:

Proposal ‘ought’ claim (‘conclusion’):  Proposal PLAN A ought to be adopted
because
1. Factual-instrumental premise:         Implementing PLAN A will lead to outcome B
                                                                     given conditions C
and
2. Deontic premise:                                  Outcome B ought to be pursued;
and
3. Factual premise:                                  Conditions C are (or will be) given.

This form is not conclusively ‘valid’ in the formal logic sense, according to which it is considered ‘inconclusive’ and ‘defeasible’. There are usually many such pros and cons supporting or questioning a proposal: no single argument (other that evidence pointing out flaws of logical inconsistency or lacking feasibility, leading to rejection) will be sufficient to make a decision. Any evaluation of planning arguments therefore must be embedded in a ‘multi-criteria’ analysis and aggregation of judgments into the overall decision.

It will become evident that all the judgments people make will be personal ‘subjective’ judgments, not only about the deontic (ought) premise but even about the validity and salience of the ‘factual’ premises: they are all about estimated about the future — not yet validated by observation and measurement.

The judgment types of planning argument premises:
‘plausibility’ and weight of importance

There are two kinds of judgments that will be needed. The first is an assessment of the ‘plausibility’ of each claim. The term ‘plausibility’ here includes the familiar‘truth’ (or degree of certainty or probability about the truth of a claim, and the advisability, acceptability, desirability of the deontic claim. It can be expressed as a judgment on a scale e.g. of -1 to +1, with ‘-1’ meaning complete implausibility to +1 expressing ‘total plausibility’, virtual certainty, and the center point of zero meaning ‘don’t know, can’t judge’ . The second one is a judgment about the ‘weight’ of relative importance‘ of the ‘ought’ aspect. It can be expressed e.g. by a score between zero meaning (totally unimportant) and +1 meaning ‘totally important’, overriding all other aspects; the sum of all the weights of deontic premises must be equal to +1.

Argument plausibility

The first step would be the assessment of plausibility of the entire single argument, which would be a function of all three premise plausibility scores to result in an ‘Argument plausibility’ score.

For example, an argument i with pl(1) =0.5, pl(2) = 0.8, and pl(3) = 0.9 might get an argument plausibility :   Argpl (i) of 0.5 x 0.8 x 0.9 = 0.36.

Argument weight of relative importance

The second step would be to assess the ‘argument weight’ of each argument, which can be done by multiplying the weight of relative importance of its deontic premise (premise 2 in the pattern above) with the argument plausibility:    Argw(i) = Argpl(i) x w(i).
That weight will again be a value between zero (meaning ‘totally unimportant’) and +1 (meaning ‘all-important’ i.e. overriding all other considerations). This should be the result of the establishment of a ‘tree’ of deontic concerns (similar to the ‘aspects’ of the ‘Formal evaluation’ procedure in procedure example 1) that gives each deontic claim its proper place as a main aspect, sub-aspect, sub-sub-aspect or ‘criterion’ in the aspect tree, and assigning weights between 0 and 1 such that these add up to 1, at each level.

A deontic claim located at the second level of the aspect tree, having been assigned a weight of .8 at that level, being a sub-aspect to an aspect at the first level with a weight of +.4 at that level, would have a premise weight of w = 0.8 x 0.4 = 0.32. The argument weight with a plausibility of 0.36 would be  Argw(i) = 0.36 x 0.32 = 0.1152 (rounded up as 0.12).

Plan plausibility

All the argument weights could the be aggregated to the overall ‘plan plausibility’ score, for example by adding up all argument weights:
Planpl = ∑ Argw(i) for all argument weights i (of an individual participant)

Of course, there are other possible aggregation forms. (See the sections on ‘Aggregation’ and ‘Decision Criteria).  Which one of those should be used in any specific case must be specified — agreed upon — in the ‘procedural agreements’ governing each planning project.

It should be noted that in a worksheet simply listing all arguments with their premises for plausibility and weigh assignments, there is no need for identifying  arguments as ‘pro’ and ‘con’, as intended by their respective authors. Any argument given a negative premise plausibility by a participant will automatically end up getting a negative argument weight and thus becoming a ‘con’ argument for that participant — even if the argument was intended by its author as a ‘pro’ argument. This makes it obvious that all such assessments are individual, subjective judgments, even if the factual and factual-instrumental premises of arguments are considered ‘objective-fact’ matters.

The process of evaluation of planning arguments within the overall discourse

The diagram below shows the argument assessment process as it will be embedded in an overall discourse. Its central feature is the ‘Next Step?’ decision, invoked after each major activity. It lets the participants in the effort decide — according to rules specified in those procedural agreements — how deeply into the deliberation process they wish to proceed: they could decide to go ahead with a decision after the first set of overall offhand judgments, skipping the detailed premise analysis and evaluation if they feel sufficiently certain about the plan.

Process of argument assessment within the overall discourse

The use of overall plan plausibility scores:
Group statistics of the set of individual plan plausibility scores.

It may be tempting to use the overall plan plausibility scores directly as decision guides or determinants.  For example, to determine a statistic such as the average of all individual scores Planpl(j) for the participants j in the assessment group, as an overall ‘group plausibility score‘ GPlanpl,  e.g.   GPlanpl = 1/n ∑ Planpl(j) for all n members of the panel.

And in evaluating a set of competing plan alternatives: to select the proposal with the highest ‘group plausibility’ score.
Such temptations should be resisted, for a number of reasons, such as: whether a discussion has succeeded in bringing in all pertinent items that should be given ‘due consideration’; the concern that planning arguments tend to be of ‘qualitative’ nature and often don’t easily address quantitative measures of performance; questions regarding principles, the time frame of expected plan effects and consequences; whether and how issues of ‘quality’ of a plan are adequately addressed in the form of arguments; and the question of the appropriate ‘social aggregation’ criterion to be applied to the problem and plan in question: many open questions:

Open questions

Likely incompleteness of the discussion
It is argued that participation of all affected parties and a live discussion will be more likely to bring our the concerns people are actually worried about, than e.g. reliance on general textbook knowledge by panels or surveys made up by experts who ‘don’t live there’. But even the assumption that the discussion guarantees complete coverage is unwarranted. For example, is somebody likely to consider raising an issue about a plan feature that they know will affect another party negatively (when they expect the plan to be good for the own faction) — if the other party isn’t aware enough about this effect, and does not raise it? Likewise; some things may be expected to be so much matters ‘of course’ that nobody considers it necessary to mention it. So unless the overall process includes several different means of getting such information — systems modeling, simulation, extensive scrutiny of other cases etc. — the argumentative discussion alone can’t be assumed to be sufficient to bring up all needed information.

Quantitative aspects in arguments.
The typical planning argument will usually be framed in more ‘qualitative’ terms than quantitative measures. For example: in an argument that “The plan will be more sustainable’ than the current situation” this matters in the plausibility assessment: It can be seen as quite plausible as long as there is some evidence of sustainability improvement, so participants may be inclined to give it a high pl-score close to +1. By comparison, if somebody instead makes the same argument but now claims a specific ‘sustainability’ performance measure — one that others may consider as too optimistic, and therefore assign it a plausibility score closer to zero or even slightly negative: how will that affect the overall assessment? What procedural provisions would be necessary to needed to adequately deal with this question?

The issue of ‘quality’ or ‘goodness’ of a proposed solution.
It is of course possible that a discussion examines the quality or ‘goodness’ of a plan in detail, but as mentioned above, this will likely also be in general, qualitative terms, and often even avoided because to the general acceptance of sayings like’ you can’t argue about beauty’ , so the discussion will have some difficulty in this respect, if it does mention beauty at all, or spiritual value, or the appropriateness of the resulting image. Likewise, requirements for the implementation of the plan, such as meeting regulations, may not be discussed.

The decreasing plausibility ‘paradox’
Arguably, all ‘systematic’ reasoning efforts, including discussion and debate, aim a giving decision-makers a higher degree of certainty about their final judgment, than, say, just fast offhand intuitive decisions. However, it turns out that the more depth as well as breadth of discussion is done, the more final plausibility judgment scores will tend to end up closer to the ‘zero’ or ‘don’t know’ plausibility — if the plausibility assessment is done honestly and seriously, and the aggregation method suggested above is used: Multiplying the plausibility assessments for the various premises (which for the factual premises will be probability estimates). These judgments being all about future expectations, they cannot honestly be given +1 (‘total certainty’) scores or even scores close to it, the less so, the farther out in the future the effects are projected. This result can be quite disturbing and even disappointing to many participants, when final scores are compared with initial ‘offhand’ judgments.
Other issues related to time have often been inadequately dealt with in evaluation of any kind:

Estimates of plan consequences over time
All planning arguments are expressing people’s expectations of the plan’s effect in the future. Of course, we know that there are relatively few cases in which a plan or action will generate results that will materialize immediately upon implementation and then stay that way. So what do we mean when we offer an argument that a plan ‘will bring improve society’s overall health’ — even resorting to ‘precise ‘statistical’ indices like mortality rates, or life expectancy? We know that these figures will change over time, one proposed policy will bring more immediate results than another, but the other will have better effect in the long run; and again, the father into the future we look, the less certain we must be about our prediction estimates. These things are not easily expressed in even carefully crafted arguments supported by the requisite statistics: how should we score their plausibility?

Tentative insights, conclusions?

These ‘not fully resolved / more work needed’ questions may seem to strengthen the case for evaluation approaches other than trying to draw support for planning decisions from discourse contributions, even with more detailed assessment of arguments than shown here (examining the evidence and support for each premise). However, the problems emerging from the examination of the argumentative process do affect other evaluation tools as well. I have not seen approaches that resolve them all more convincingly. So:       Some first tentative conclusions are that planning debate and discourse  — too familiar and accessible to experts and lay people alike to be dismissed in favor of other methods — would benefit from enhancements such as the argument assessment tools, but also, opportunities and encouragement should be offered to draw upon other tools, as called for by the circumstances of each case and the complexity of the plans.

These techniques, methods, should be made available for use by experts and lay discourse participants, in a ‘toolkit’ part of a general planning discourse support platform — not as mandatory components of a general-purpose one-size-fit-all planning method but as a repository of tools for creative innovation and expansion: Because plans as well as the process that generate plans define those involved as ‘the creators of that plan’ , there will be a need to ‘make a difference, to make it theirs: by changing, adapting, expanding and using the tools in new and different ways, besides inventing new tools in the process.

References:
Rittel, Horst: “APIS: A Concept for an Argumentative Planning Information System” Institute of Urban and Regional Development, University of California at Berkeley, 1980 . A report about research activities conducted for the Commission of European Communities, Directorate General XIIA.
–o–

 

 

EVALUATION IN THE PLANNING DISCOURSE: SAMPLE EVALUATION PROCEDURES EXAMPLE 1: FORMAL ‘QUALITY‘ EVALUATION

Thorbjørn Mann,  January 2020

In the following segments, a few examples procedures for evaluation by groups will be discussed, to illustrate how the various parts of the evaluation process are selectively assembled into a complete process aiming at decision (or recommendation) for decision about a proposed plan or policy; to facilitate understanding of the way the different provisions and choices related to the evaluation task that are reviewed in this study can be assembled to practical procedures for specific situations. The examples are not intended to be universal recommendations for use in all situations. They all will — arguably — call for improvement as well as adaptation to the specific project and situation at hand.

A common evaluation situation is that of a panel of evaluators comparing a number of proposed alternative plan solutions to select or recommend the ‘best’ choice for adoption. Or — if there is only one proposal, — to determine if it is ‘good enough’ for implementation. It is usually carried out by a small group of people assumed to be knowledgeable of the specific discipline (for example, architecture) and reasonably representative of the interests of the project client (which may be the public). The rationale for such efforts, besides aiming for the ‘best’ decision, is the desire for ensuring that the decision will be based on good expert knowledge, but also for transparency and legitimacy and accountability of the process — to justify the decision. The outcome will usually be a recommendation to the actual client decision-makers rather than the actual adoption or implementation decision, based on the group’s assessment of the ‘goodness’ or ‘quality’ of the proposed plan, documented in some form. (It will be referred to as a ‘Formal Quality Evaluation’ procedure.)

There are of course many possible variations of procedures for this task. The sample procedure described in the following is based on the Musso-Rittel (1) procedure for the evaluation of the ‘goodness’ or quality of buildings.

The group will begin by agreeing on the procedure itself and its various provisions: the steps to be followed (for example, whether evaluation aspects and weighting should be worked out before or after presentation of the plan or plan alternatives), general vocabulary, judgment and weighting scales, aggregation functions both for individual overall judgments and group indices, and decision rules for determining its final recommendation.

Assuming that the group has adopted the sequence of first establishing the evaluation aspects and criteria against which the plan (or plans) will be judged, the first step will be a general discussion of the aspects and sub-aspects to be considered, resulting in the construction of the ‘aspect tree’ of aspects, sub-aspects, sub-sub-aspects etc. (ref. the section on aspects and aspect trees) and criteria (the ‘objective’ measures of performance; ref. the section on evaluation criteria). The resulting tree will be displayed and become the basis for scoring worksheets.

The second step will be the assignment of aspect weights (on a scale of zero to to 1 and such that at each level of the ‘tree’, the sum of weights at that level will be 1. Panel members will develop their own individual weighting. This phase can be further refined by applying ‘Delphi Method’ steps: establishing and displaying the mean / median and extreme weighting values and then asking the authors of extremely low or high weights to share and discuss their reasoning for these judgments, and giving all members the chance to revise their weights.

Once the weighted evaluation aspect trees have been established, the next step will be the presentation of the plan proposal or competing alternatives.

Each participant will assign a first ‘overall offhand’ quality score (on the agreed-upon scale, e.g. -3 to +3) to each plan alternative.

The group’s statistics of these scores are then established and displayed. This may help to decide whether any further discussion and detailed scoring of aspects will be needed: there may be a visible consensus for a clear ‘winner’. If there are disagreements, the group decides to go through with the detailed evaluation, and the initial scores are kept for later comparison with the final results. using common worksheets or spreadsheets of the aspect tree, for panel members to fill in their weighting and quality scores. This step may involve the drawing of ‘criterion functions’ (ref. the section of evaluation criteria and criterion functions) to explain how each participant’s quality judgments depend on (objective) criteria or performance measures. These diagrams may be discussed by the panel. They should be considered each panel member’s subjective basis of judgment (or representation of the interests of factions in the population of affected parties). However, some such functions may be the mandatory official regulations (such as building regulations). The temptation to urge adoption of common (group) functions (‘for simplicity and expression of ‘common purpose’) should be resisted to avoid possible bias towards the interests of some parties at the expense of others.

Each group member will then fill in the scores for all aspects and sub-aspects etc. The results will be compiled, and the statistics compared; extreme differences in the scoring will be discussed, and members given the chance to change their assessments. This step may be repeated as needed (e.g. until there are no further changes in the judgments).

The results are calculated and the group recommendation determined according to the agreed-upon decision criterion. The ‘deliberated’ individual overall scores are compared with the members’ initial ‘offhand’ scores. The results may cause the group to revise the aspects, weights, or criteria, (e.g. upon discovering that some critical aspect has been missed), or call for changes in the plan, before determining the final recommendation or decision (again, according to the initial procedural agreements).

The steps are summarized in the following ‘flow chart’.

Evalmap15 FormalevalEvaluation example 1: Steps of a ‘Group Formal Quality Evaluation’

Questions related to this version of a formal evaluation process may include the issue of potential manipulation of weight assignments by changing the steepness of the criterion junction.
Ostensibly, the described process aims at ‘giving due consideration’ to all legitimately ‘pertinent’ aspects, while eliminating or reducing the role of ‘hidden agenda’ factors. Questions may arise whether such ‘hidden’ concerns might be hidden behind other plausible but inordinately weighted aspects. A question that may arise from discussions and argumentation about controversial aspects of a plan and the examination of how such arguments should be assessed (ref. the section on a process for Evaluation of Planning Arguments) is the role of plausibility judgments about the premises of such arguments: esp. the probability of assumption claims that a plan will actually result in a desired or undesired outcome (an aspect). Should the ‘quality’ assessment’ process include a modification of quality scores based on plausibility / probability scores, or should this concern be explicitly included in the aspect list?

The process may of course seem ‘too complicated’, and if done by ‘experts’, invite critical questions whether the experts really can overcome their own interests, bias and preconceptions to adequately consider the interests of other, less‘expert’ groups. The procedure obviously assumes a general degree of cooperativeness in the panel, which sometimes may be unrealistic. Are more adequate provisions needed for dealing with incompatible attitudes and interests?

Other questions? Concerns? Missing considerations?

–o–