Reverse Engineering of Object Oriented Code. Springer

Monographs in Computer Science

Paolo Tonella

Alessandra Potrich

Visit Springer's eBookstore at: http://ebooks.springerlink.com
and the Springer Global Website Online at: http://www.springeronline.com

Reverse Engineering of Object Oriented Code
Monographs in Computer Science

Foreword

There has been an ongoing debate on how best to document a software system

ever since the first software system was built. Some would have us writing natural

language descriptions, some would have us prepare formal specifications,

others would have us producing design documents and others would want us

to describe the software thru test cases. There are even those who would have

us do all four, writing natural language documents, writing formal specifications,

producing standard design documents and producing interpretable test

cases all in addition to developing and maintaining the code. The problem

with this is that whatever is produced in the way of documentation becomes

in a short time useless, unless it is maintained parallel to the code. Maintaining

alternate views of complex systems becomes very expensive and highly

error prone. The views tend to drift apart and become inconsistent.

The authors of this book provide a simple solution to this perennial problem.

Only the source code is maintained and evolved. All of the other information

required on the system is taken from the source code. This entails

generating a complete set of UML diagrams from the source. In this way, the

design documentation will always reflect the real system as it is and not the

way the system should be from the viewpoint of the documentor. There can

be no inconsistency between design and implementation. The method used is

that of reverse engineering, the target of the method is object oriented code in

C++, C#, or Java. From the code class diagrams, object diagrams, interaction

diagrams and state diagrams are generated in accordance with the latest

UML standard. Since the method is automated, there are no additional costs.

Design documentation is provided at the click of a button.

This approach, the result of many years of research and development, will

have a profound impact upon the way IT-systems are documented. Besides

the source code itself, only one other view of the system needs to be developed

and maintained, that is the user view in the form of a domain specific language.

Each application domain will have to come up with it’s own language

to describe applications from the view point of the user. These languages may

range from natural languages to set theory to formal mathematical notations.

What these languages will not describe is how the system is or should be constructed.

This is the purpose of UML as a modeling language. The techniques

described in this book demonstrate that this design documentation can and

should be extracted from the code, since this is the cheapest and most reliable

means of achieving this end. There may be some UML documents produced

on the way to the code, but since complex IT systems are almost always developed

by trial and error, these documents will only have a transitive nature.

The moment the code exists they are both obsolete and superfluous. From

then on, the same documents can be produced cheaper and better from the

code itself. This approach coincides with and supports the practice of extreme programming.

Of course there are several drawbacks, as some types of information are

not captured in the code and, therefore, reverse engineering cannot capture

them. An example is that there still needs to be a test oracle – something to

test against. This something is the domain specific specification from which

the application-oriented test cases are derived. The technical test cases can

be derived from the generated UML diagrams. In this way, the system as

implemented will be verified against the system as specified. Without the

UML diagrams, extracted from the code, there would be no adequate basis of comparison.

For these and other reasons, this book is highly recommendable to all

who are developing and maintaining Object-Oriented software systems. They

should be aware of the possibilities and limitations of automated post documentation.

It will become increasing significant in the years to come, as the

current generation of OO-systems become the legacy systems of the future.

The implementation knowledge they encompass will most likely be only in the

source and there will be no other means of regaining it other than through reverse engineering.

Trento, Italy, July 2004

Benevento, Italy, July 2004

Harry Sneed

Aniello Cimitile

Preface

Diagrams representing the organization and behavior of an Object Oriented

software system can help developers comprehend it and evaluate the impact of

a modification. However, such diagrams are often unavailable or inconsistent

with the code. Their extraction from the code is thus an appealing option.

This book represents the state of the art of the research in Object Oriented

code analysis for reverse engineering. It describes the algorithms involved

in the recovery of several alternative views from the code and some of the

techniques that can be adopted for their visualization.

During software evolution, availability of high level descriptions is extremely

desirable, in support to program understanding and to change-impact

analysis. In fact, location of a change to be implemented can be guided by

high level views. The dependences among entities in such views indicate the

proportion of the ripple effects.

However, it is often the case that diagrams available during software evolution

are not consistent with the code, or – even more frequently – that no

diagram has altogether been produced. In such contexts, it is crucial to be

able to reverse engineer design diagrams directly from the code. Reverse engineered

diagrams are a faithful representation of the actual code organization

and of the actual interactions among objects. Programmers do not face any

misalignment or gap when moving from such diagrams to the code.

The material presented in this book is based on the techniques developed

during a collaboration we had with CERN (Conseil Européen pour la

Recherche Nucléaire). At CERN, work for the next generation of experiments

to be run on the Large Hadron Collider has started in large advance, since

these experiments represent a major challenge, for the size of the devices,

teams, and software involved. We collaborated with CERN in the introduction

of tools for software quality assurance, among which a reverse engineering tool.

The algorithms described in this book deal with the reverse engineering of

the following diagrams:

Class diagram: Extraction of inter-class relationships in presence of weakly

typed containers and interfaces, which prevent an exact knowledge of the

actual type of referenced objects.

Object and interaction diagrams: Recovery of the associations among

the objects that instantiate the classes in a system and of the messages

exchanged among them.

State diagram:

Modeling of the behavior of each class in terms of states and state transitions.

Package diagram: Identification of packages and of the dependences among packages.

All the algorithms share a common code analysis framework. The basic

principle underlying such a framework is that information is derived statically

(no code execution) by performing a propagation of proper data in a graph

representation of the object flows occurring in a program. The data structure

that has been defined for such a purpose is called the Object Flow Graph

(OFG). It allows tracking the lifetime of the objects from their creation along

their assignment to program variables.

UML, the Unified Modeling Language, has been chosen as the graphical

language to present the outcome of reverse engineering. This choice was motivated

by the fact that UML has become the standard for the representation

of design diagrams in Object Oriented development. However, the choice of

UML is by no means restrictive, in that the same information recovered from

the code can be provided to the users in different graphical or non graphical formats.

A well known concern of most reverse engineering methods is how to filter

the results, when their size and complexity are excessively high. Since

the recovered diagrams are intended to be inspected by a human, the presentation

modes should take into account the cognitive limitations of humans

explicitly. Techniques such as focusing, hierarchical structuring and element

explosion/implosion will be introduced specifically for some diagram types.

The research community working in the field of reverse engineering has

produced an impressive amount of knowledge related to techniques and tools

that can be used during software evolution in support of program understanding.

It is the authors’ opinion that an important step forward would be

to publish the achievements obtained so far in comprehensive books dealing

with specific subtopics.

This book on reverse engineering from Object Oriented code goes exactly

in this direction. The authors have produced several research papers in this

field over time and have been active in the research community. The techniques

and the algorithms described in the book represent the current state of the art.

Trento, Italy

July 2004

Paolo Tonella

Alessandra Potrich

Introduction

Reverse engineering aims at supporting program comprehension, by exploiting
the source code as the major source of information about the organization
and behavior of a program, and by extracting a set of potentially useful views
provided to programmers in the form of diagrams. Alternative perspectives
can be adopted when the source code is analyzed and different higher level
views are extracted from it. The focus may either be on the structure, on
the behavior, on the internal states, or on the physical organization of the
files. A single diagram recovered from the code through reverse engineering
is insufficient. Rather, a set of complementary views need to be obtained,
addressing different program understanding needs.

In this chapter, the role of reverse engineering within the life cycle of a
software system is described. The activities of program understanding and
impact analysis are central during the evolution of an existing system. Both
activities can benefit from sources of knowledge about the program such as
reverse engineered diagrams.

The reverse engineering techniques presented in the following chapters are
described with reference to an example program used throughout the book. In
this chapter, this example program is introduced and commented. Then, some
of the diagrams that are the object of the following chapters are provided for
the example program, showing their usefulness from the programmer’s point
of view. The remaining parts of the book contain the algorithmic details on
how to recover them from the source code.

Reverse Engineering
In the life cycle of a software system, the maintenance phase is the largest
and the most expensive. Starting after the delivery of the first version of the
software [35], maintenance lasts much longer than the initial development
phase. During this time, the software will be changed and enhanced over and
over. So it is more appropriate to speak of software evolution with reference
to the whole life cycle, in which the initial development is only a special case
where the existing system is empty.

Software evolution is characterized by the existence of the source code of
the system. Thus, the typical activity in software evolution is the implementation
of a program change, in response to a change request. Changes may
be aimed at correcting the software (corrective maintenance), at adding a
functionality ( perfective maintenance), at adapting the software to a changed
environment (adaptive maintenance), or at restructuring it to make future
maintenance easier ( preventive maintenance) [35].

During software evolution, the most reliable and accurate description of
the behavior of a software system is its source code. In fact, design diagrams
are often outdated or missing at all. Such a valuable information repository
may not directly answer all questions about the system. Reverse engineering
techniques provide a way to extract higher level views of the system,
which summarize some relevant aspects of the computation performed by the
program statements. Reverse engineered diagrams support program comprehension,
as well as restructuring and traceability.

When an existing code base is worked on, the micro-process of program
change can be decomposed into localizing the change, assessing the impact,
and implementing the change. All such activities depend on the knowledge
available about the program to be modified. In this respect, reverse engineering
techniques are a useful support. Reverse engineering tools provide useful
high level information about the system being maintained, thus helping programmers
locate the component to be modified. Moreover, the relationships
(dependencies, associations, etc.) that connect the entities in reverse engineered
diagrams provide indications about the impact of a change. By tracing
such relationships the set of entities possibly affected by a change are obtained.

Object Oriented programming poses special problems to software engineers
during the maintenance phase. Correspondingly, reverse engineering
techniques have to be customized to address them. For example, the behavior
of an Object Oriented program emerges from the interactions occurring among
the objects allocated in the program. The related instructions may be spread
across several classes, which individually perform a very limited portion of
the work locally and delegate the rest of it to others. Reverse engineered diagrams
capture such collaborations among classes/objects, summarizing them
in a single, compact view. However, recovering accurate information about
such collaborations represents a special challenge, requiring major improvements
to the available reverse engineering methods [48, 100].

When a software system is analyzed to extract information about it, the
fundamental choice is between static and dynamic analysis. Dynamic analysis
requires a tracer tool to save information about the objects manipulated and
the methods dispatched during program execution. The diagrams that can
be reverse engineered in this way are partial. They hold valid for a single,
given execution of the program, with given input values, and they cannot be
easily generalized to the behavior of the program for any execution with any

The eLib Program
input. Moreover, dynamic analysis is possible only for complete, executable
systems, while in Object Oriented programming it is typical to produce incomplete
sets of classes that are reused in different contexts. On the contrary,
a static analysis produces results that are valid for all executions and for all
inputs. On the other side, static analyses may be over-conservative. In fact,
it is undecidable to determine if a statically possible path is feasible, i.e., if
there exists an input value allowing its traversal. Static analysis may conservatively
assume that some paths are executable, while they are actually not so.

Consequently, it may produce results for which no input value exists. In the
following chapters, the advantages and disadvantages of the two approaches
will be discussed for each specific diagram, illustrating them on an executable example.

UML (Unified Modeling Language) [7, 69] has become the standard graphical
language used to represent Object Oriented systems in diagrammatic form.
Its specifications have been recently standardized by the Object Management
Group (OMG) [1]. UML has been adopted by several software companies, and
its theoretical aspects are the subject of several research studies. For these reasons,
UML was chosen as the graphical representation that is produced as the
output of the reverse engineering techniques described in this book. However,
the choice of UML is by no means limiting: while the information reverse
engineered from the code can be represented in different graphical (or non
graphical) forms, the basic analysis methods exploited to produce it can be
reused unchanged in alternative settings, with UML replaced by some other
description language.

An important issue reverse engineering techniques must take into account
is usability. Since the recovered views are for humans and not for computers,
they must be compatible with the cognitive abilities of human beings. This
means that diagrams convey useful information only if their size is kept small
(while 10 entities may be fine, 100 starts being too much and 1000 makes a
diagram unreadable). Several approaches can be adopted to support visualization
and navigation modes making reverse engineered information usable.
They range from the possibility to focus on a portion of the system, to the
expand/collapse or zoom in/out operations, or to the availability of an overall
navigation map complemented by a detailed view. In the following chapters,
ad hoc methods will be described with reference to the specific diagrams being produced.

The eLib Program
The eLib program is a small Java program that supports the main functions
operated in a library. Its code is provided in Appendix A. It will be used in
the remaining of this book as the example.

In eLib, libraries are supposed to hold an archive of documents of different
categories, properly classified. Each document can be uniquely identified by
the librarian. Library users can request some of these documents for loan,
subjected to proper access rules. In order to borrow a document, users must be
identified by the librarian. For example, this could be achieved by distributing
library cards to registered users.

As regards the management of the documents in the eLib system, the
librarian can insert new documents in the archive and remove documents
no longer available in the library. Upon request, the librarian may need to
search the archive for documents according to some search criterion, such as
title, authors, ISBN code, etc. The documents held by a library are of several
different kinds, including books, journals, and technical reports. Each of them
has specific properties and specific access restrictions.

As far as user management is concerned, a set of personal data (name,
address, phone number, etc.) are maintained in the archive. A special category
of users consists of internal users, who have special permission to access
documents not allowed for loan to normal users.
The main functionality of the eLib system is loan management. Users can
borrow documents up to a maximum number. While books are available for
loan to any user, journals can be borrowed only by internal users, and technical
reports can be consulted but not borrowed.

Although this is a small application, by going through the source code
of the eLib program (see Appendix A) it is not so easy to understand how
the classes are organized, how they interact with each other to fulfill the
main functions, how responsibilities are distributed among the classes, what
is computed locally and what is delegated. For example, a programmer aiming
at understanding this application may have the following questions:
What is the overall system organization?
What objects are updated when a document is borrowed?
What classes are responsible to check if a given document can be borrowed
by a given user?
How is the maximum number of loans handled?
What happens to the state of the library when a document is returned?
Let us assume the following change request (perfective maintenance):
When a document is not available for loan, a user can reserve it, if it
has not been previously reserved by another user. When a document
is returned to the library, the user who reserved it is contacted, if
any is associated with the document. The user can either borrow the
document that has become available or cancel the reservation. In both
cases, after this operation the reservation of the document is deleted.
the programmer who is responsible for its implementation may have the following
questions about the system:
Does the overall system organization need any change?
What classes need to collaborate to realize the reservation functionality?

Class Diagram
Is there any possible side effect on the existing functionalities?
What changes should be made in the procedure for returning documents
to the library?
How is the new state of a document described?
Is there any interaction between the new rules for document borrowing
and the existing ones?
In the following sections, we will see how UML diagrams reverse engineered
from the code can help answer the program understanding and impact analysis
questions listed above.

Table of Contents

Foreword XI

Preface XIII

1. Introduction

Reverse Engineering

The eLib Program

Class Diagram

Object Diagram

Interaction Diagrams

State Diagrams

Organization of the Book

2. The Object Flow Graph

Abstract Language

Declarations

Statements

Object Flow Graph

Containers

Flow Propagation Algorithm

Object sensitivity

The eLib Program

Related Work

3. Class Diagram

Class Diagram Recovery

Recovery of the inter-class relationships

Declared vs. actual types

Flow propagation

Visualization

Containers

Flow propagation

The eLib Program

Related Work

Object identification in procedural code

4. Object Diagram

The Object Diagram

Object Diagram Recovery

Object Sensitivity

Dynamic Analysis

Discussion

The eLib Program

OFG Construction

Object Diagram Recovery

Discussion

Dynamic analysis

Related Work

5. Interaction Diagrams

Interaction Diagrams

Interaction Diagram Recovery

Incomplete Systems

Focusing

Dynamic Analysis

Discussion

The eLib Program

Related Work

6. State Diagrams

State Diagrams

Abstract Interpretation

State Diagram Recovery

The eLib Program

Related Work

7. Package Diagram

Package Diagram Recovery

Clustering

Modularity Optimization

Feature Vectors

Concept Analysis

The eLib Program

Related Work

8. Conclusions

Tool Architecture

Language Model

The eLib Program

Change Location

Impact of the Change

Perspectives

Related Work

Code Analysis at CERN

A Source Code of the eLib program

B Driver class for the eLib program

References

Index

Screenshot

Purchase Now !
Just with Paypal

Product details

Price	2.00 USD
Pages	223 p
File Size	6,627 KB
File Type	PDF format
ISBN	0-387-23803-4
Copyright	2005 Springer Science +Business Media, Inc

●▬▬▬▬▬❂❂❂▬▬▬▬▬●

●▬▬❂❂▬▬●

●▬❂▬●

●❂●

═════● ●═════

Reverse Engineering of Object Oriented Code. Springer

Monographs in Computer Science

Paolo Tonella

Alessandra Potrich

Foreword

Preface

Contact Form