<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://christop.club/feed.xml" rel="self" type="application/atom+xml" /><link href="https://christop.club/" rel="alternate" type="text/html" /><updated>2025-11-24T10:41:28+00:00</updated><id>https://christop.club/feed.xml</id><title type="html">christop.club</title><subtitle>Christopher S. Corley is a software engineer in Chattanooga, TN
</subtitle><entry><title type="html">Reviews for “Web Usage Patterns of Developers”</title><link href="https://christop.club/2015/08/01/reviews-for-web-usage-patterns-of-developers/" rel="alternate" type="text/html" title="Reviews for “Web Usage Patterns of Developers”" /><published>2015-08-01T00:00:00+00:00</published><updated>2015-08-01T00:00:00+00:00</updated><id>https://christop.club/2015/08/01/reviews-for-web-usage-patterns-of-developers</id><content type="html" xml:base="https://christop.club/2015/08/01/reviews-for-web-usage-patterns-of-developers/"><![CDATA[<p>I’m really racking them up this year at <a href="http://www.icsme.uni-bremen.de/">ICSME’15</a>!  This one is an Industry
track paper I did with the good folks over at <a href="http://codealike.com">Codealike</a>.</p>

<p>Here is the <a href="/publications/pdfs/Corley-etal_2015b.pdf">preprint</a>. You’ll find
more information about this paper on my <a href="/publications">publications</a> page as I
add it.</p>

<h1 id="abstract">Abstract</h1>

<blockquote>
  <p>Developers often rely on the web-based tools for troubleshooting, collaboration,
issue tracking, code reviewing, documentation viewing, and a myriad of other
uses. Developers also use the web for non-development purposes, such as reading
news or social media. In this paper we explore whether web usage is detriment to
a developer’s focus on work from a sample of over 150 developers. Additionally,
we investigate if highly-focused developers use the web differently than other
developers. Our qualitative findings suggest highly-focused developers use the
web differently, but we are unable to predict a developer’s focused based on web
usage alone. Further quantitative findings suggest that web usage has does not
negatively impact on a developer’s focus.</p>
</blockquote>

<h1 id="review-1">Review 1</h1>

<p>This is an interesting paper that primarily presents data - which is great.</p>

<p>In a few cases - such as the Office Collaboration tools - I would have like to see some attempt at better understanding why the quartiles were so inconsistent (or didn’t fall off in what might be an expected way). I don’t know if you have anything in your data that might be revealing.</p>

<h1 id="review-2">Review 2</h1>

<p>Summary</p>

<p>The authers studied how developers use online content and to which degree their development focus is influenced.
The aim of their study is to find out if software quality might be affected by any internet content while online tools in general are required for development.
They concluded that usingthe internet for general purposes must not influence developers, but highly focused developers use the web rarely, anyway.</p>

<p>Strong Aspects</p>

<p>In general, it is a good direction of research to better understand developers behaviour, and especially state-of-the-art influences on their daily work.
Studying the open web content and the raising accessibility to it during work is an interesting field as well.
The paper described the data base and their processing in a good manner and the paper is written in good style.</p>

<p>Week Aspects</p>

<p>Accept of the overall topic and the paper’s quality, I miss a more qualitative discussion. Perhaps, the authors should get in contact with psychological groups. Especially in the topic of attention and productivity.</p>

<p>For example, I miss a discussion of productivity patterns such as the promodoro principle (i.e., no one is able to keep attention active during the whole day, so intentially get distracted in a time boxed manner).
Furthermore, social interaction is very important for anyone also to raise productivity. In a similar manner, people differ how much interaction they need to be as much efficient / focused as they can. The authers did not study the effect for each individual how it might be affected for example if the access to social networks is blocked.
Additionally, the authers did not discuss if some participants might worked for companies with already blocked social networks.</p>

<p>On a detailed level I miss a more clear speration between the web content. For example, how did you decide between technical blogs and general purpose blogs not relevant for their work?</p>

<p>Finally, there is no discussion of a threat of validity due to the influence on the participants by the measurement itself. If they knew that their behaviour is analysed, they might have act differently.</p>]]></content><author><name></name></author><category term="reviews" /><category term="open science" /><category term="web activity" /><category term="mining" /><category term="developer focus" /><category term="codealike" /><summary type="html"><![CDATA[I’m really racking them up this year at ICSME’15! This one is an Industry track paper I did with the good folks over at Codealike.]]></summary></entry><entry><title type="html">Reviews for “Exploring the Use of Deep Learning for Feature Location”</title><link href="https://christop.club/2015/07/25/reviews-for-doc2vec-feature-location/" rel="alternate" type="text/html" title="Reviews for “Exploring the Use of Deep Learning for Feature Location”" /><published>2015-07-25T00:00:00+00:00</published><updated>2015-07-25T00:00:00+00:00</updated><id>https://christop.club/2015/07/25/reviews-for-doc2vec-feature-location</id><content type="html" xml:base="https://christop.club/2015/07/25/reviews-for-doc2vec-feature-location/"><![CDATA[<p>I’ve been blessed with a second publication at <a href="http://www.icsme.uni-bremen.de/">ICSME’15</a>!  This one is an
Early Research Achievements (ERA) track paper on using Gensim’s <a href="http://radimrehurek.com/gensim/models/doc2vec.html">Doc2Vec</a> for
feature location.</p>

<p>Overall, I am very happy with the comments we received!  Below are the reviews.</p>

<p>Here is the <a href="/publications/pdfs/Corley-etal_2015a.pdf">preprint</a>. You’ll find
more information about this paper on my <a href="/publications">publications</a> page as I
add it.</p>

<h1 id="review-1">Review 1</h1>

<p>Summary:</p>

<p>The goal of this work is to support feature location using deep learning approaches. The authors claim that deep learning provides the ability to incorporate the order of terms as opposed to traditional feature location techniques, which have treated source code as an unordered set of terms.The authors report improvements in performance (using mean reciprocal rank) using a particular deep learning model over a set of six software systems. Additionally the authors estimate the average time to rank per query and the model training time.</p>

<p>Strengths:</p>

<ul>
  <li>+Emerging area in SE research</li>
  <li>+Using real systems</li>
</ul>

<p>Weaknesses:</p>

<ul>
  <li>-Conceptual gaps in the presentation</li>
  <li>-Study design</li>
</ul>

<p>Comments:</p>

<p>Feature location is a critical maintenance task, and bringing deep learning models to bear is a new approach certainly worth examination. Please consider the following comments and questions to strengthen the paper.</p>

<p>“Therefore, when querying for diagram, program elements where redraw is also present are considered more relevant and thus are boosted in the rankings.” Technically, LDA would consider the co-occurrence in this example too. (Yes, LDA would discard information on the order.) So what precisely will distinguish the approach based on deep learning models in this respect? Is it exclusively the order of terms? Would n-gram topic models be relevant here?</p>

<p>“We also suggest directions for future work on the use of DVs (or other deep learning models) to improve developer effectiveness in feature location.” I recommend rewording this statement since the concern is not improving <em>developer</em> effectiveness per se. The concern is improving the effectiveness of feature location engines.</p>

<table>
  <tbody>
    <tr>
      <td>“A deep learning neural network encodes source code identifiers, in the order they appear in the source code.” Technically, this statement is not correct. Imagine a software corpus with</td>
      <td>I</td>
      <td>source code identifiers. Consider a neural network with several hidden layers and an input layer with size</td>
      <td>I</td>
      <td>, where each unit in the input layer corresponds to an identifier. If we represent source code files as vectors of relative frequencies of identifiers and train the model to learn a compact representation of its input, then this deep learning neural network does not encode identifiers in the order they appear in the source code.</td>
    </tr>
  </tbody>
</table>

<p>The semantic similarity example given in Sec. II.B should be ported to the SE domain for effect. Moreover, at the end of Sec. II, I still don’t have a good understanding of the model nor how the model is designed to address concerns inherent to feature location in a new or more efficient way.</p>

<p>Please consider adding research questions to Sec. III.</p>

<p>A general summary of tests supporting the statistical significance and effect of the results reported in Sec. III would rigorously support the authors’ claims on the performance gains.</p>

<p>Re: Tab. IV: Is model training time really a concern? LDA—the baseline model—appears to be on the order of one second (for 100 topics).</p>

<p>Why is the related work started by referring to n-grams, which do not appear to bear on the problem at hand? I would expect a reference to n-grams in the paper (aside from the second sentence of the abstract) if they are in fact related.</p>

<p>“statistical models of natural language text able to capture more complex patterns while being trained using smaller corpora relative to the n-gram model [3] [8]” It is not clear how these references substantiate the claim of using smaller corpora. I also don’t see the need to even emphasize smaller corpora.</p>

<p>Another related paper that should be discussed is on configuring LDA for SE tasks: Panichella, A., Dit, B., Oliveto, R., Di Penta, M., Poshyvanyk, D., and De Lucia, A., “How to Effectively Use Topic Models for Software Engineering Tasks? An Approach based on Genetic Algorithms”, in Proceedings of 35th IEEE/ACM International Conference on Software Engineering (ICSE’13), San Francisco, CA, May 18-26, 2013, pp. 522-531</p>

<p>Minor points:</p>

<ul>
  <li>
    <p>“Deep learning models are a class of neural networks.” I recommend rewording this statement. Fundamentally, deep learning models are characterized by multiple “levels” of nonlinear transformations. Neural networks are a convenient abstraction for deep learning, but I would shy away from explicitly subtyping deep learning models from neural networks even though neural networks dominate deep learning applications;</p>
  </li>
  <li>
    <p>“that has shown promising results in modeling natural language” needs citation(s);</p>
  </li>
  <li>
    <p>In the third paragraph of the Introduction, the last sentence needs a citation;</p>
  </li>
  <li>
    <p>“4Hz processor” should probably be 4GHz processor.</p>
  </li>
</ul>

<p>There are several grammatical and typographical issues in the current version. Should the paper be accepted, the authors should fix these issues to ensure the paper is in the best possible shape for the camera-ready version.</p>

<h1 id="review-2">Review 2</h1>

<p>Summary</p>

<p>The paper describes an initial study of using a neural network machine learning approach for feature location.  They use document vectors (DV) and compare it to LDA.</p>

<p>Comments</p>

<p>Pretty straightforward paper without any big surprises or critical missing information.  They leveraged Dit et al work for the experiment and setup.</p>

<p>DV appears to be about the same accuracy as LDA but looks to be much faster in query time and training time.  So some trade offs.</p>

<p>DV is a technique not previously applied to this problem and could be another useful tool for addressing SE problems.</p>

<p>Overall I see these as pretty interesting results that would be a nice addition to the technical program.</p>

<h1 id="review-3">Review 3</h1>

<p>In this paper, the authors investigate the use of a deep learning model, document vectors (DV), for feature location. The authors compare DV with LDA on 633 queries from 6 versions of 4 software systems.</p>

<p>The paper is clearly written and presents an intriguing new approach worthy of further investigation. Although DV has been applied to source code in the past by White, et al., this work is the first time it has been trained on specific software systems, rather than a larger corpus.</p>

<p>The authors cited a number of FLTs in the related work. This paper would be strengthened by comparing with other approaches. I was disappointed not to see a simple baseline such as tf-idf included. At minimum, the choice of LDA should be justified. Is it the current most accurate FLT out there? Is it the most similar to DV?</p>

<p>Key points:</p>

<ul>
  <li>
    <ul>
      <li>results of study intriguing for future FLT investigation (important for community)</li>
    </ul>
  </li>
  <li>
    <ul>
      <li>clearly written</li>
    </ul>
  </li>
  <li>
    <ul>
      <li>stronger case if other FLTs included</li>
    </ul>
  </li>
</ul>

<p>Specific comments:</p>

<p>I applaud the author’s use of the standard IR measure of mean reciprocal rank (MRR) to evaluate their proposed FLT. However, the authors incorrectly attribute the definition of MRR to Poshyvanyk, et al. It would be more appropriate to say something like: “Similar to the study by Poshyvanyk, et al., we use MRR to compare…”</p>]]></content><author><name></name></author><category term="reviews" /><category term="feature location" /><category term="open science" /><category term="doc2vec" /><category term="word2vec" /><category term="lda" /><category term="topic models" /><summary type="html"><![CDATA[I’ve been blessed with a second publication at ICSME’15! This one is an Early Research Achievements (ERA) track paper on using Gensim’s Doc2Vec for feature location.]]></summary></entry><entry><title type="html">Reviews for “Modeling Changeset Topics for Feature Location”</title><link href="https://christop.club/2015/07/03/reviews-for-changeset-feature-location/" rel="alternate" type="text/html" title="Reviews for “Modeling Changeset Topics for Feature Location”" /><published>2015-07-03T00:00:00+00:00</published><updated>2015-07-03T00:00:00+00:00</updated><id>https://christop.club/2015/07/03/reviews-for-changeset-feature-location</id><content type="html" xml:base="https://christop.club/2015/07/03/reviews-for-changeset-feature-location/"><![CDATA[<p>Here are the set of reviews for my <a href="http://www.icsme.uni-bremen.de/">ICSME’15</a>
main track paper! Unfortunately, this bad boy was initially rejected from
<a href="http://www.saner.polymtl.ca/">SANER’15</a>, but we made many changes to the paper
and it got in at ICSME. You can find a link to the PDF, code, and everything
else in my <a href="/publications">publications</a>.</p>

<h1 id="saner-rejection">SANER Rejection</h1>

<h2 id="reviewer-1">Reviewer 1</h2>

<p>This paper proposes to apply topic-modelling based information retrieval
techniques (i.e., LDA and LSI) for feature location from the incremental
changesets of source code. As an online learning algorithm based on changesets
is adopted, it is not necessary to do retraining and get the updated topic
models frequently.  The authors further conduct evaluation on 14 open source
Java projects to show the feasibility and effectiveness of the changesets
approach.</p>

<p>Overall, this paper presents an interesting idea of using changesets for better
feature location. Although LDA and LSI have been widely investigated in feature
location domain, it is innovative to use the changesets from the version
control system (e.g., SVN or Git) for feature location. The approach of
modelling changeset topics is originally from reference [7]. This paper’s
contributions mainly lie in the application of the approach of modelling
changeset to feature location problem.  The evaluation also seems to be solid.
The authors publish the experiment data for public review. In the evaluation,
it is good to use Wilcoxon signed-rank test with Holm correction to determine
the statistical significance of the difference between results from LDA and
LSI. However, as the authors mention in evaluation, few of the evaluated
systems presented a statistically significant value between Snapshot based
approach and changeset based approach.</p>

<p>The following issues need to be clarified: First, in this paper the authors use
commit message in Git and SVN as the representation of a changeset in a version
control system. Although the information among multiple versions of the project
is used, the paper still focuses on feature location in a single version of the
product. My concern is that if a feature that needs to be located is involved
in several changes, how good can the proposed approach handle it? The authors
may also better to show the effectiveness of the approach for features that are
relevant and irrelevant to commit messages.</p>

<p>Second, the authors may also need to relate their work to the work on feature
location on multiple versions of products. They may refer to the following
literature and discuss about the application of modelling changeset topics for
feature location in multiple versions.</p>

<p>Yinxing Xue, Zhenchang Xing, Stan Jarzabek: Feature Location in a Collection of
Product Variants. WCRE 2012: 145-154</p>

<p>Third, for evaluation, the authors may try different parameter setup and
measures of retrieval accuracy. Currently, the number of topic is set to 500.
Actually, 500 topics may work well for normal documents based on natural
language like English (see S.T. Dumais. LSI meets TREC: A status report, in
Proceeding of Text Retrieval Conference, pp. 137-152. 1992), but a larger size
of topic may be preferred for information retrieval on source code, considering
more identifiers in source code. With regard to the measures, the authors only
use the mean reciprocal rank (MRR). The authors may also consider some measures
used in information retrieval domain, like Percentage of Relevant Queries
(PRQ), Mean Average Precision (MAP) and Average Percentage of Code Units
Investigated (APCUI). The different measures may reveal the different aspects
of the results.</p>

<p>Below are also some detailed comments on the presentation and language of the paper:</p>

<ol>
  <li>
    <p>In introduction, in the last third paragraph, “Our results show that not
only is our changeset approach feasible and practical, but in some cases
out-performs current snapshot approaches.” Here, the authors should be more
specific about the cases in which the proposed approach performs better.</p>
  </li>
  <li>
    <p>The approach section is a bit too simple. You may add more details, or merge
it with some content in Section II.A and Section II.B.</p>
  </li>
  <li>
    <p>In the fourth paragraph of section IV.C, in the first sentence, “we our
partitioning is inclusive of that commit.” should be “our partitioning is
inclusive of that commit.”</p>
  </li>
</ol>

<h2 id="reviewer-2">Reviewer 2</h2>

<p><em>Note: I had to leave this beauty verbatim…</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Review follows A.J. Smith 4/90 IEEE/Computer 

Recommendation. 
maybe

Summary and Significance. 
 What is the purpose?   Is the the problem clearly stated? 
	Incremental modeling of text-based retrieval systems for program comprehension. This is a significant goal for the SANER audience.

Is there an early description of the accomplishments? 
          No. in particular, the authors fail to mention that the method works only sometimes. 

 Is the problem new?   Using I/R for program comprehension is not; the incremental change set approach is.

     Has the design been built before?  no

     Has the problem been solved before?  no

     Is this a trivial variation on or an extension of a previous result?  no

    Is the author aware of of related work? yes

     Does the author cite previous work and make distinctions from  it?  yes

    If an implementation, are there new ideas?  yes

 Is the method of approach valid?  yes

        Is the approach sufficient for the purpose? yes

        Sufficient discussion of new ideas? no; reasons for failure to reject null hypothesis need to be clarified.

Is the actual execution of the research correct? yes

      Algorithms correct? Convincing?  yes

      Did the author do what was claimed?  no

 Are the correct conclusions drawn from the results?  no

      What are the applications/implications of the results?  I'm not sure.

      Adequate discussion of these results? 
         There is discussion of what happened; not why it happened.

 Is the presentation satisfactory? 

      Readability? yes

      Does abstract describe the paper?  please use a structured abstract.

     Does the introduction describe the problem and the framework?  yes

     Appropriate amount of detail?  yes

     Figures/tables appropriate? too many.

     Self-contained? yes
</code></pre></div></div>

<h2 id="reviewer-3">Reviewer 3</h2>

<p>The authors present an incremental topic-model approach to feature location
based on change sets. They evaluate the technique by comparing changes sets,
snapshots, and temporal change sets.</p>

<p>Excellent job sharing materials and making the work replicable by others.</p>

<p>Although the writing was clear, it was difficult to follow the thread of the
research and how the study design answered the research questions. Especially
missing are the big take away messages — what should a researcher or
practitioner take away from this study in using change sets or snapshots for
FLT?</p>

<p>Specific comments:</p>

<p>The explanation in section III seems unclear. Intuitively, I would think the
topic model is run once on a snapshot, and then run incrementally on all the
change sets after that point (up to the commit being searched). This approach
is hinted at in the introduction (“online topic models can be instantiated once
and incrementally updated over time.“) However, the wording in the following
sentence:</p>

<p>“The changeset topic modeling approach requires two types of document
extraction: one for the snapshot of the state of source code at a commit of
interest, such as a tagged release, and one for the every changeset in the
source code history leading up to that commit.”</p>

<p>Sounds like topic modeling is run on all the changes leading up to the
snapshot. Is this the target usage scenario? Please clarify the writing to make
the target usage scenario &amp; algorithmic steps more clear. Figure 1 is a good
start, but doesn’t clearly show how the change sets are involved. Figure 1
seems to show that the topic modeler is run on the whole snapshot every time,
which I thought the purpose of the work was to avoid this?</p>

<p>I think the key insight behind the approach — “The key intuition to our
approach is that a topic model such as LDA or LSI can infer any given
document’s topic proportions regardless of the documents used to train the
model.” — needs to be expanded. Isn’t this idea one of the main contributions
of the work? A concrete example showing why this intuition is valid would help.</p>

<p>In section IV.C., the purpose of \theta_Queries is not yet clear, and it is
difficult to see how this fits in to the larger study. It would be helpful if
there were a big picture paragraph in the methodology section describing the
parts of the study and how they are used to answer the research questions
before diving in to the details. For instance, in this section I don’t yet know
what temporal simulations look like, although that is one of the contributions
of the work. It seems as if someone within the research team would perfectly
comprehend section IV.C, but is not written so that a reader familiar with
feature location can discern what is being evaluated and why when reading the
paper from beginning to end.</p>

<p>Section IV.E: “To answer RQ1, we run the experiment on the snapshot and
changeset datasets as outlined in Section IV-C. We then calculate the MRR
between the two.” What two? How does this comparison help us answer RQ1? And
then: “To answer RQ2, we run the experiment temporally as outlined in Section
IV-C” the high-level goals of the temporal experiment and how it differs from a
traditional experiment have not yet been described. Why are traceability links
important to answering the research questions? It seems that the authors had
some trouble making use of the Moreno data set. What is the advantage to
keeping it in? More replications? Why include both Tables I &amp; II, if only the
data from Table II is used in the study?</p>

<p>It seems as if some of the high-level information I’m looking for might be
partially buried in the discussion section in G, rather than being up front to
help the reader understand the design of the experiment.</p>

<p>The work of Rao, et al. Seems closely related. In section II, can you
differentiate why such an approach is less desirable than your proposed
approach? (or evaluate?)</p>

<p>Typos:</p>

<ul>
  <li>p. 5 C: “the process is varies slightly”</li>
  <li>p. 5 C: “we our partitioning is“</li>
  <li>conclusion: In this paper WE? conducted a study</li>
</ul>

<h2 id="the-mytical-reviewer-4">(the mytical) Reviewer 4</h2>

<p>Dear authors,</p>

<p>We would like to thank you for your submission that has lead to a lively
discussion in the program committee. The main concerns raised by the committee
pertain to:</p>

<ul>
  <li>
    <p>the paper’s claim that the proposed approach analyzes multi-versions of
changeset data, yet it seems that the paper did not really make good use of
multi-version changeset data in the proposed approach and in the evaluation.</p>
  </li>
  <li>
    <p>the fact that one of the reviewers familiar with this domain was not able to
understand the approach since the paper has multiple issues making key points
clear</p>
  </li>
</ul>

<h1 id="icsme-acceptance">ICSME Acceptance</h1>

<h2 id="reviewer-1-1">Reviewer 1</h2>

<p>The authors present a new approach in the context of feature location. They use
information available in the a software configuration management system to
incrementally perform concept location, so reducing time to perform such a kind
of task. I found the idea behind the authors’ proposal very interesting even if
it is not completely new in the context of software maintenance. The results
support the validity of the new approach. The paper flow is adequate even if in
some points I had some difficulties. For such difficulty I was not able to be
completely confident with the the work done. Also, further details and
justifications could be provided by the authors in the experimental part of the
paper. All in all, I’m happy enough with the work done. It is one of the best
papers I reviewed till now this year at ICSME.</p>

<p>In the following I’ll elaborate on the weakness points I see. I hope the
authors will found them useful.</p>

<p>In the motivation part of the introduction, there are some points that seem
contrasting each other. In particular, the authors wrote: “Indeed, given the
current state-of-the-art in TR, it is impossible for an FLT to satisfy all
three criteria while following the standard methodology.” Did Rao [10] and
Hoffman at al. [9] make a contribution to satisfy all the three criteria?
Reading the paper (and the Introduction, in particular) it seems YES.</p>

<p>Online (using fold-in and fold-out) LSI has been also applied in the context of
architecture recovery. Mentioning this paper in the introduction section could
further motivate your wok:</p>

<p>Michele Risi, Giuseppe Scanniello, Genoveffa Tortora: Using fold-in and
fold-out in the architecture recovery of software systems. Formal Asp. Comput.
24(3): 307-330 (2012)</p>

<p>The part where the approach is highlighted in the introduction section needs to
be rewritten because in the current form is not easy to follow. I read that
paragraph more and more, but my comprehension level did not change: completely
unclear.</p>

<p>Please discuss better on [10] and [18] in the related work section. In
addition, it is not completely clear to me what the difference is between the
proposed approach and [28].</p>

<p>Regarding the experimental part of the paper, I found very hard to understand
the methodology (especially second paragraph). Last paragraph, the authors
mentioned the dataset by Dit et al. Was the dataset by Moreno et al  treated
differently? Why?</p>

<p>Reading the description of the experiment, I was not able to understand whether
the authors simulated the use of GitHub. I mean, were all the applications and
the change sets in the used datasets in GitHub?</p>

<p>Last paragraph in section IV.E is not clear. I mean the place where the authors
justify why RQ2 has been studied only on one dataset.</p>

<p>In section IV.F, the authors discussed on the fact that the p-value was greater
than 0.05. In particular, they wrote: “This suggests that changeset topics are
just as accurate as snapshot topics at the method-level, especially since there
is a lack of statistical significance for any of the cases.”  Since the null
hypothesis has not been rejected, the authors can only discuss on descriptive
statistics. That it, it seems that the authors accept the null hypothesis and
this is definitively incorrect.</p>

<p>A statistical test (i.e., that chosen) verifies the presence of significant
difference between two groups (in your case), but it does not provide any
information about the magnitude of such a difference (if present). The
magnitude of such a difference could be computed using a (non-parametric)
effect size measure (e.g., Cliff’s d). You could also use the average
percentage improvement/reduction.</p>

<p>Why the authors did not analyze execution time?</p>

<p>In the threats to validity you should also consider biases related to the
statistical analysis performed (Conclusion validity). The readability could
improve organizing threats in: Internal, External, Conclusion, and Construct.</p>

<p>Typing and formatting minor issues:</p>

<p>At the end of section III.C, there is (between brackets) a strange symbol.</p>

<p>Figure 2 is not compressible if the paper is printed black and white.</p>

<p>Please remove orphans.</p>

<p>Section 4.B - it is not so good reading the description of the experimental
objects as the authors did.</p>

<h2 id="reviewer-2-1">Reviewer 2</h2>

<p>This paper proposes a topic-modeling-based feature location technique in which
the text retrieval model (i.e., topic model) is built incrementally from source
code history. The technique uses an online learning algorithm to train topic
models based on change sets, and thus can maintain an up-to-date model without
incurring computational cost associated with retraining traditional
snapshot-based topic models. The proposed technique has been evaluated and the
results indicate that the accuracy of the technique is similar to that of a
snapshot-based feature location technique.</p>

<p>This paper reports an interesting exploration of applying incrementally built
topic models for feature location. It has the potential of improving current
IR-based feture location methods with lower computational cost on building text
retrieval models. But I think the paper still has a large space to improve.</p>

<p>First, the motivation of the paper is not clear and it is not well reflected in
the evaluation. It seems that the main benefit of the proposed technique is the
saving of computational cost associated with retraining traditional
snapshot-based topic models. However, there is no analysis about how much
computational cost can be saved. If the training of a snapshot-based topic
model only takes a short time (e.g., several minutes), it is acceptable that
the topic model is retrained for each release. Moreover, the saving of
computational cost is not evaluated in the experimental study.</p>

<p>Second, the proposed technique is not well described. In the section presenting
the technique (Section III), Section III-A and III-C respectively presents
terminology and explains the reason why change set is used. Section III-B
introduces the proposed technique, which is very short. Some important details
are missing, for example how change set corpus are combined with snapshot
corpus in training topic models? The process described in Figure 1 (B) does not
reflect the incremental manner of the proposed technique.</p>

<h2 id="reviewer-3-1">Reviewer 3</h2>

<p>The paper presents a topic-modeling-based Feature Location Technique (FLT)
where, to reduce the computational cost, the model is updated incrementally
from the changesets of commits from the project history instead of entire
snapshots. The approach is evaluated on 1,200 defects on publicly available
dataset (from 14 open-source Java projects) and is shown to exhibit accuracy
not lower than the accuracy of more traditional models built on entire
snapshots. The data and source code for the analysis are provided in an online
appendix.</p>

<p>The idea is novel and the approach has potential. Not much work has addressed
the issue of incremental model building in IR based feature location (the paper
misses some related work – see below). The motivation behind building a model
incrementally is to reduce the computational cost of rebuilding a model from
every snapshot. The approach presented in the paper is sensible and the results
indicate that it is a direction worth following. However, the paper also has
several points where it needs some improvement.</p>

<p>The original motivation suggests that the changesets will update the model
incrementally. My expectation was that every changeset will be considered
separately, i.e., the model will be updated using a changeset. However, neither
of the two research questions actually evaluates the approach in that setting.
In RQ1 the changeset-based model is built using all changesets at once. In RQ2,
the changesets are grouped into partitions based on the bug report that they
are linked to and the model is updated using a partition. The first question
here is why grouping changesets and why not updating the model after each
commit? And then if a grouping is to be made, why not approximate a more
realistic setting, i.e., update the model with every consecutive 10 commits,
for example. Consecutive commits will address different bugs and thus will
certainly have different topic distribution. My doubt here is to what extent
the grouping in RQ2 may have introduced a bias in the results? By the w! ay,
the part describing the historical simulation is somewhat confusing – at least
I had to read it twice to fully understand what exactly is being done.</p>

<p>When investigating the accuracy of the models built on the changesets the
thresholds are selected without justification and no tuning. For instance, for
the number topic models in all analyzed projects is fixed to 500. The paper
justifies the lack of parameter tuning with the fact that the “goal is to show
the performance of the changeset-based FLT against snapshot-based FLT under the
same conditions” and that “the measurements collected are fair and that the
results are not influenced by selective parameter tweaking”. However, poor
selection of the parameters may lead to poor results and thus unrealistic
optimism that the proposed changeset-based FLT performs as good as traditional
snapshot-based FLTs. This doubt is somewhat confirmed by results shown in
Tables 1 and 2: The Mean Reciprocal Rank (MRR) is used to measure the
effectiveness of a FLTs for a set of queries; the higher the value the better
the result. The values for MRR shown in Tables 1 and 2 are quite low and this
is true for both models. For example, for the project Pig v0.8.0, the MRR is ~
0.011. This score of MRR would mean that the minimum rank for a relevant class
would be on average ~ 90 (out of 442 classes in this project). The MRR reported
by Moreno et al. varies depending on the settings and type of information that
is considered but stays between 0.18 and 0.26 for the same project. This
corresponds to ranks 6 and 4 (again out of 442). Thus, the doubt here is that
the results of the snapshot-based FLT using the selected parameters are poor
and the only thing that one can conclude is that the changeset-based FLT is not
making the poor results worse. Now whether the poor results are due to the
underlying techniques (i.e., LDA and-or LSI) or to the parameter selection only
is not clear but is probably worth investigating.</p>

<p>RQ1 should be rephrased maybe as a hypothesis “Changeset-based FLT is less
accurate than snapshot based FLT”. Then the data shows that this cannot be
proved.</p>

<p>Regarding RQ2, it is not clear how the accuracy’s “fluctuation” of the CFL
technique is measured as a project evolves. I do not think the MRR metric by
itself measures such fluctuation, or at least this is not explained in the
paper. The MRR only measures accuracy. I would think that series analysis on
the MRRs across time would be the way to go or other analysis of this kind.
Now, it seems that the goal was onlu to compare the accuracy when changesets
are used to incrementally update the topic model, as opposed to update the
model at once with all the changesets. Unfortunately, it is not clear whether
what the goal really is. I suggest to clarify this issue and perhaps
reformulate RQ2. After all, the main goal of the paper is to test how the CFL
would perform in a realistic environment where the model is incrementally
updated with changes in commits.</p>

<p>The paper omits the LSI results “for brevity”. If they are omitted completely,
it is best not to even mention them. The best thing to do is to at least
mention how they compare wrt LDA.</p>

<p>Detailed comments:</p>

<p>p1:</p>

<ul>
  <li>“By training an online learning algorithm using changesets, the FLT maintains
an up-to-date model without incurring the non-trivial computational cost
associated with retraining traditional FLTs.”: As shown in Fig. 1 the
snapshots are still used for indexing. Thus, the computational cost is saved
in the process of building the topic model. What is exactly the saved
computational cost? To better motivate the paper I would recommend to give a
citation or an example of how long it takes to create a topic model for a
large system such as eclipse using LDA. Also, it is a good idea to provide
the cost saving of the Online LDA technique, compared to the standard LDA.</li>
  <li>“It follows from the first two observations (1: Like a class/method
definition, a changeset has program text; 2: Unlike a class/method
definition, a changeset is immutable.) that it is possible for an FLT
following our methodology to satisfy all three of the criteria above. “: It
is not clear how the first criterion is satisfied, i.e., “(1) accurate like a
TM-based FLT”</li>
  <li>“We then used a subset of over 600 defects and features to conduct a
historical simulation that demonstrates how the FLTs perform as a project
evolves.”: Why 600?</li>
  <li>The preprocessing often includes stemming, but stemming is not mentioned
here. Later (p.6, Section IV Study) it becomes clear that no stemming is
applied without justifying why.</li>
</ul>

<p>p2:</p>

<ul>
  <li>“Normalizing: replace each upper case letter with the corresponding lower
case letter”: Lawrie et al. use “normalization” for vocabulary normalization
(i.e., the alignment of the vocabulary found in source code with that found
in other software artifacts). See: D. Lawrie, D. Binkley, and C. Morrell.
Normalizing source code vocabulary. In Proceedings of the Working Conference
on Reverse Engineering (WCRE), pages 3-12, 2010</li>
  <li>“corpus is a set of documents (i.e., methods)”: “i.e.,” -&gt; “e.g.,”</li>
</ul>

<p>p5:</p>

<ul>
  <li>Section IV.C. (Methodology) can be broken down into subsections based on the
RQs.</li>
  <li>To answer RQ2 (Does the accuracy of a changeset-based FLT fluctuate as a
project evolves?), the paper describes the so-called historical simulation
where commits are related to each query (or issue) and partitions of
mini-batches of changesets are created. The model is then updated using a
mini-batch. An index of topic distributions with the snapshot corpus is then
inferred. I don’t understand why for the historical simulation, commits are
grouped into partitions of mini-batches instead of updating the model after
every commit.</li>
  <li>“on all documents extracted.” -&gt; extracted documents</li>
</ul>

<p>p6:</p>

<ul>
  <li>The paragraph starting with “After extracting tokens, we split … “ is not
needed. The preprocessing, except the stemming, is already explained in
Section II.A.</li>
  <li>Thresholds are missing justifications: K, the number of topics, is chosen to
be 500; the two parameters that control how much influence a new mini-batch
has on the model when training are 1024 and 0.9. No justification is given
for the selected values. What are the values selected in related works?</li>
</ul>

<p>p10:</p>

<ul>
  <li>Ref. [2]: the publication date is 2013.</li>
  <li>The references should be consistent. For example, the venue of the references
7, 17, 19 and 20 have the following form: “Software Engineering, IEEE
Transactions on”; instead of “IEEE Transactions on Software Engineering”.</li>
</ul>

<p>Missing references to related work:</p>

<p>Hsin-yi Jiang, Tien N. Nguyen, Carl K. Chang, and Fei Dong, “Traceability Link
Evolution Management with Incremental Latent Semantic Indexing”, in Proceedings
of the 31st IEEE International Computer Software and Applications Conference
(IEEE COMPSAC 2007), pages 309-316, July 24-27,2007</p>

<p>Hsin-yi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang,
“Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution
Management”, in Proceedings of the 23rd ACM/IEEE International Conference on
Automated Software Engineering (ACM/IEEE ASE 2008), September 15-19, 2008</p>

<p>Ratanotayanon, Sukanya, Hye Jung Choi, and Susan Elliott Sim. “Using transitive
changesets to support feature location.” Proceedings of the IEEE/ACM
International Conference on Automated Software Engineering, 2010</p>]]></content><author><name></name></author><category term="reviews" /><category term="mining software repositories" /><category term="changesets" /><category term="feature location" /><category term="open science" /><category term="lda" /><category term="topic models" /><summary type="html"><![CDATA[Here are the set of reviews for my ICSME’15 main track paper! Unfortunately, this bad boy was initially rejected from SANER’15, but we made many changes to the paper and it got in at ICSME. You can find a link to the PDF, code, and everything else in my publications.]]></summary></entry><entry><title type="html">papers.bib or: How I Learned to Stop Worrying and Love the Reference Manager</title><link href="https://christop.club/2015/05/23/papers-bib/" rel="alternate" type="text/html" title="papers.bib or: How I Learned to Stop Worrying and Love the Reference Manager" /><published>2015-05-23T00:00:00+00:00</published><updated>2015-05-23T00:00:00+00:00</updated><id>https://christop.club/2015/05/23/papers-bib</id><content type="html" xml:base="https://christop.club/2015/05/23/papers-bib/"><![CDATA[<p>I recently completed and passed my phd thesis proposal. During my time
struggling to get myself together and organized, I gave up on trying to manage
BibTeX file by hand. Here, I’m going to describe the software and strict
workflow I’ve been using to manage a single thesis bibliography, <code class="language-plaintext highlighter-rouge">papers.bib</code>.</p>

<p>My criteria were the following:</p>

<ol>
  <li>Cross-platform</li>
  <li>Open source as heck</li>
  <li>Keeps all machines in sync</li>
  <li>Easy to restore an old version</li>
  <li>On-demand opening/viewing of the source PDF</li>
  <li>Usable <em>offline</em></li>
</ol>

<h1 id="software">Software</h1>

<p>I use three primary programs to manage my <code class="language-plaintext highlighter-rouge">papers.bib</code> file:</p>

<ol>
  <li><a href="http://jabref.sourceforge.net/">JabRef</a>
    <ul>
      <li>I chose JabRef as the reference manager because it is open source,
cross-platform, and can open PDFs or URLs directly.</li>
      <li>As an added plus, it has support for plugins and a BibTeX downloader.</li>
    </ul>
  </li>
  <li><a href="http://git-scm.com/">git</a>
    <ul>
      <li>git was the obvious choice for versioning of the <code class="language-plaintext highlighter-rouge">papers.bib</code>.</li>
      <li>Easy to back up on Github/Bitbucket/whatever.</li>
    </ul>
  </li>
  <li><a href="https://syncthing.net/">Syncthing</a>
    <ul>
      <li>Syncthing was chosen because it was a solid open-source replacement for
Dropbox.</li>
      <li>Used for keeping previously downloaded PDF files in sync.</li>
    </ul>
  </li>
</ol>

<p>I can satisfy all my criteria with a combination of these three tools.</p>

<h2 id="setup">Setup</h2>

<p>The first thing I do (did?) is to start a git repository in <code class="language-plaintext highlighter-rouge">~/papers/</code>. In
this folder I place my main <code class="language-plaintext highlighter-rouge">papers.bib</code> file for JabRef to manage.</p>

<h3 id="jabref">JabRef</h3>

<p>There’s only one essential configuration option I rely on: the BibTeX key
generator. In JabRef’s preferences, I set the default pattern to be
<code class="language-plaintext highlighter-rouge">[auth.etal]_[year]</code>. You can use whatever you fancy, but be sure to use
something with file system safe characters (e.g., avoid special characters like
<code class="language-plaintext highlighter-rouge">:</code>).</p>

<h3 id="syncthing">Syncthing</h3>

<p>I set up <code class="language-plaintext highlighter-rouge">~/papers</code> for syncing within Syncthing. In Syncthing, you can set up
<em>ignore patterns</em> for files it should ignore during sync. I use these:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.git
.git*
papers.bib
</code></pre></div></div>

<p>This means it will skip git-related stuff and the main <code class="language-plaintext highlighter-rouge">papers.bib</code>, but <code class="language-plaintext highlighter-rouge">git</code>
will take care of all of those.</p>

<h3 id="git">git</h3>

<p>Likewise, I set git up to ignore the PDF files by placing this into the
<code class="language-plaintext highlighter-rouge">.gitignore</code>: <code class="language-plaintext highlighter-rouge">*.pdf</code>. Syncthing will be managing those PDFs.</p>

<h1 id="workflow">Workflow</h1>

<p>When I run across a paper I want to cite, or even <em>read</em>, I download all of the
following:</p>

<ol>
  <li>
    <p>BibTeX: sometimes I copy in an entry directly from the web source, and other
times I use the JabRef web search &amp; downloading feature. <em>Very rarely</em> have
I needed to make entries manually. Manual entries are usually theses or
books.</p>
  </li>
  <li>
    <p>DOI/URL: All papers must have the DOI or URL. If I by chance lose a PDF, I
can find the original source again.</p>
  </li>
  <li>
    <p>PDF: Every paper must have the PDF associated with it. The only exception
are library sources, which <em>if possible</em> I will make sure the URL points to
the library online catalog entry or a place to buy it online (Amazon).</p>
  </li>
</ol>

<p>Once the BibTeX is in JabRef, the first thing I make sure to do is insert the
DOI/URL if it doesn’t already have one. One thing I noticed is that sites like
<a href="http://ieeexplore.ieee.org">IEEEXplore</a> don’t include the DOI in the downloaded BibTeX, but list it on
the paper’s web page. I make sure to grab that.</p>

<p>The second thing I do is attach the PDF file. If you right click an entry,
“Attach file” will be in the menu. Normally, the downloaded PDF name is
horrible and gross. Hence, I use a handy plugin to help with that.</p>

<h2 id="renamefile">renameFile</h2>

<p>There’s a plugin that is critical for my JabRef use: <a href="https://github.com/korv/Jabref-plugins">renameFile</a>.</p>

<p>renameFile comes with two configuration options: folder and name pattern. I use
both, leaving the “folder” blank (i.e., it uses whatever directory <code class="language-plaintext highlighter-rouge">papers.bib</code>
is in to place PDFs). Because of the way we’ve configured JabRef’s key
generation option, I leave the name pattern as <code class="language-plaintext highlighter-rouge">[bibtexkey]</code>.</p>

<p>After attaching a file, I simply hit “rename” in the plugin window, verify the
file is being renamed as expected, and I’m done. It will rename the PDF file to
match the BibTeX key and move the file to the <code class="language-plaintext highlighter-rouge">~/papers/</code> folder. Yay!</p>

<h2 id="git-commit">git commit</h2>

<p>After I finish up adding new sources or finish writing for the day, I make sure
to check in the <code class="language-plaintext highlighter-rouge">papers.bib</code> file into git. When committing, <strong><em>I always check
the <code class="language-plaintext highlighter-rouge">git diff</code> to make sure nothing was removed, only added</em></strong>. That last bit
is critical, cause it can tell you when something is amiss. I also push the
changes to a public-facing <a href="https://github.com/cscorley/papers">Github repository</a>.</p>

<h1 id="writing-and-collaboration">Writing and collaboration</h1>

<p>Having a single, dedicated <code class="language-plaintext highlighter-rouge">papers.bib</code> comes with one major caveat when trying
to collaborate: people are going to insert things into the <em>working
bibliography</em>, hence breaking the workflow of the <em>central bibliography</em>
entirely! Not sure there’s much to do about that, but here’s my current
workaround.</p>

<p>Each paper I work on has it’s own separate git repo. I always merge in my
<code class="language-plaintext highlighter-rouge">papers.bib</code> file as the “main” source and check it into git. That means <em>git
is managing two separate versions of the “same file” in two separate repos</em>,
which can certainly be confusing. Luckily, <code class="language-plaintext highlighter-rouge">diff</code> makes it easy to determine
the differences between the working and central bibliographies.</p>

<p>Whenever someone makes a change to the working bibliography, I make sure to
<em>immediately</em> merge the new entries into my central bibliography by following
the workflow I describe above. If it is going to be in a paper with my name in
it, I am going to have it for future reference. I do this by literally checking
<code class="language-plaintext highlighter-rouge">diff -us ~/papers/papers.bib path/to/collab/papers.bib</code> manually every time I
begin writing. I know, this part sucks. You could also make sure by checking
<code class="language-plaintext highlighter-rouge">git whatchanged</code> after a <code class="language-plaintext highlighter-rouge">git pull</code>.</p>

<p>After the new changes are merged into the central bibliography, I overwrite the
working one with the central one. This ensures I can see whenever a change is
introduced after I add in the DOI, URL, or PDF fields.</p>

<h1 id="summary">Summary</h1>

<p>I know that seems like a lot of work – oh, it is – but trust me, it becomes
so much easier to use after it is setup and working.  Be vigilant in
maintaining it and future you will thank you for having a central source for
the references, along with links and PDFs.</p>

<p>One immediate need I’ve noted has to do with collaboration. While the workflow
worked really well for my proposal as I was the only one working on it,
collaboration immediately exposed flaws. For now, I’m manually working around
this limitation.</p>]]></content><author><name></name></author><category term="writing" /><category term="references" /><category term="bibliography" /><category term="bibtex" /><category term="jabref" /><summary type="html"><![CDATA[I recently completed and passed my phd thesis proposal. During my time struggling to get myself together and organized, I gave up on trying to manage BibTeX file by hand. Here, I’m going to describe the software and strict workflow I’ve been using to manage a single thesis bibliography, papers.bib.]]></summary></entry><entry><title type="html">My reviews from ICMSE2014 Tool track</title><link href="https://christop.club/2015/01/09/my-reviews-from-icsme2014-tools/" rel="alternate" type="text/html" title="My reviews from ICMSE2014 Tool track" /><published>2015-01-09T00:00:00+00:00</published><updated>2015-01-09T00:00:00+00:00</updated><id>https://christop.club/2015/01/09/my-reviews-from-icsme2014-tools</id><content type="html" xml:base="https://christop.club/2015/01/09/my-reviews-from-icsme2014-tools/"><![CDATA[<p>This past year, I had the privilege to serve on the
<a href="http://www.icsme.org/">ICSME2014</a> Tool demo track.</p>

<p>Of the four papers I helped review, two were accepted. Here are those reviews.</p>

<h2 id="paper-1">Paper 1</h2>

<p><em><a href="http://dx.doi.org/10.1109/ICSME.2014.110">Context-sensitive Code Completion Tool for Better API Usability</a></em></p>

<p>By Muhammad Asaduzzaman, Chanchal K. Roy, Kevin Schneider and Daqing Hou.</p>

<ul>
  <li>Overall: 3 (strong accept)</li>
  <li>Confidence: 3 (medium)</li>
</ul>

<p>This paper presents a tool for code completion. In particular, it builds
a model of common patterns of API usage and uses the context of the code
currently being written to find a similar pattern for suggestions.
Benefits of this model is that the autocompletion is quick and
can recommend without needing to know what the developer is looking for
(e.g., any method starting with a typed letter).</p>

<p>Suggestions for improvement:</p>

<ul>
  <li>References 10-13 would be better off as footnote URLs.</li>
  <li>There is a bad citation at the top of the second column of the first page.</li>
</ul>

<p>Overall, this paper is clean and straightforward. I like the context
usage of the current code being written. While the demo video was
geared toward code being written for the first time, I wonder how it
performs in a maintenance context.</p>

<h2 id="paper-2">Paper 2</h2>

<p><em><a href="http://dx.doi.org/10.1109/ICSME.2014.107">Reviewer Recommender of Pull-Request in GitHub</a></em></p>

<p>By Yue Yu, Huaimin Wang, Gang Yin and Charles Ling.</p>

<ul>
  <li>Overall: -1 (weak reject)</li>
  <li>Confidence: 4 (high)</li>
</ul>

<p>This paper presents a tool for automatically recommending code reviewers
to pull requests (PR) on Github. A reviewer is considered as anyone that
has commented on a PR in the past. Using past PRs, they combine the
semantic similarity of the text of the new PR and the social network of
developers of previous PRs. The semantic similarity is a simple VSM.
The social network is built by extracting developer mentions in the
comments. They report on a study of several popular Github projects,
reaching 0.74 precision and 0.71 recall for top-1 and top-10
recommendation, respectively.</p>

<p>Problems:</p>

<ul>
  <li>
    <p>In the approach, what stemmer is used?</p>
  </li>
  <li>
    <p>What are the list of stopwords?</p>
  </li>
  <li>
    <p>It is unclear if developers commenting on their own PR are included.
Several projects use Github PRs as a code review tool, and
a conversation occurs between contributors, including the PR
requester. Including or excluding the original requester based on
their developer status at the time may affect the results.</p>
  </li>
  <li>
    <p>It is unclear exactly how the recommendation from the vector space
model is combined with the social network. Is more weight put in to
the semantic similarity or the network? Subsection 3-D, reviewer
recommendation, needs elaboration. It is a key factor to how the
approach works.</p>
  </li>
  <li>
    <p>I could not find a way to download and use the tool on the given
website. How do I run this on my own projects? The website presented
seems mostly like a browser for output of the actual tool.</p>
  </li>
  <li>
    <p>The demo video seems more of a presentation than a demo. Perhaps this
is due to my previous bullet point.</p>
  </li>
</ul>

<p>Overall, I think the approach is interesting. But I don’t see how I can
apply this tool on other projects.</p>]]></content><author><name></name></author><category term="reviews" /><category term="icsme" /><category term="open science" /><summary type="html"><![CDATA[This past year, I had the privilege to serve on the ICSME2014 Tool demo track.]]></summary></entry><entry><title type="html">Reviews from MUD2014</title><link href="https://christop.club/2015/01/09/reviews-from-mud2014/" rel="alternate" type="text/html" title="Reviews from MUD2014" /><published>2015-01-09T00:00:00+00:00</published><updated>2015-01-09T00:00:00+00:00</updated><id>https://christop.club/2015/01/09/reviews-from-mud2014</id><content type="html" xml:base="https://christop.club/2015/01/09/reviews-from-mud2014/"><![CDATA[<p>To keep up with practicing some <a href="http://en.wikipedia.org/wiki/Open_Science">open
science</a>, here are the reviews to
the MUD’2014 paper I “recently” published.</p>

<p>You can find a link to the PDF, code, slides, and talk in my
<a href="/publications">publications</a>.</p>

<h2 id="review-1">Review #1</h2>

<p>This paper describes an evaluation of the inputs to LDA topic models.  Topic models are a very valuable tool in software engineering research, and too often they are used without much configuration.  This paper presents a study of an aspect of this configuration, to help other researchers: whether change sets or “snapshots” produce more-distinct topics and use the same vocabulary.</p>

<p>The authors found mixed results, in that the changesets did seem to result in more-distinct topics for 2 systems, but in the another 2 systems, there were not noticeable differences.  Likewise, the vocabulary used in the changesets was measurably different than the snapshots.</p>

<p>While the scale of the study is small, and the results somewhat mixed, the paper does have the capacity to cause good discussion at the workshop, given the importance of topic models in SE.  For example, researchers using SE can take guidance from this paper that it may be necessary to try both changesets and snapshots.</p>

<p>The chief improvement to this paper would be to increase the number of programs that are studied.  With more systems, it might be possible for the paper to make a recommendation more-strongly for one or the other dataset.</p>

<h2 id="review-2">Review #2</h2>

<p>Summary:
The paper investigates whether topics extracted from chagesets are different from topics extracted from snapshots. The study has been performed on four systems and the authors exploited LDA to extract topics. Results are somehow in contrast between the four object systems.</p>

<p>Evaluation:
The paper is well written and easy to follow. The posed research questions make sense, and the paper’s topic is for sure of interest for the MUD audience. However, I am not sure what I can learn from such a paper.</p>

<p>I mean, I cannot understand how the findings reported in the paper can be used in any SE application or can impact the way of conducting SE Empirical studies. The authors should spend some words (during the results discussion and the conclusions) to explain why their findings are of interest for the research community. For instance, what should I learn from the fact that the cosine distance between the two corpus (i.e., changeset and release) is very small for three out of the four systems? Has PostgreSQL something special? The authors could remove Figure 3 (not useful at all) and use the saved space to better present and discuss the implications behind their findings.</p>

<h2 id="review-3">Review #3</h2>
<p>Desc.:</p>

<p>Most bug localization, feature location, and link traceability studies extract topics from one snapshot of a software repository. Rather than extracting topics from one snapshot of a software repository, another alternative is to extract topics from the differences (lines added and lines added) between two consecutive revisions in a repository. The paper extracts topics this way and evaluate the quality of the resultant topics by using the concept of topic distinctness. To extract changeset topics several steps are performed: first, git diff is used to get the changeset; second, tokens are extracted and split based on camel case, underscores, and non-letters; third, stop words are removed; finally, the documents are input to an LDA implementation (Gensim’s LDA). An experiment on changeset corpora from 4 systems, ant, AspectJ, Joda-Time, and PostgreSQL have been performed. The experiment show that for two of the systems, the words that appear in a changeset corpus are similar!
  to words that appear in a corpus extracted from one snapshot of a software repository (release corpora). Furthermore, for two out of the four systems, the topics that are extracted from changeset corpus have higher topic distinctness scores than topics that are extracted from release corpus.</p>

<p>Pros:</p>

<ul>
  <li>The paper analyzes 4 software systems and compares the topics extracted from changeset corpus and release corpus using topic distinctness.</li>
  <li>Experiment shows that at least for some software systems word distribution in a changeset corpus are rather different than word distribution in a release corpus (cosine distance of 0.3 or higher).</li>
  <li>Experiment shows that in two software systems the topic distinctness score of topics extracted from a changeset corpus is higher than the topic distinctness score of topics extracted from a release corpus.</li>
</ul>

<p>Comments for Improvement:</p>

<ul>
  <li>
    <p>It seems Thomas et al. have also modelled changeset topics before (Reference [3]). It is not clear what are the differences between Thomas et al.’s approach and the proposed approach. The paper states: “we find similar topic distinctness scores” and “our approach is feasible, as it captures distinct topics while not needing post-processing and is always up-to-date with the source code repository”. What kind of post-processing was performed by Thomas et al.’s approach that is not performed by the proposed approach? Is it bad to perform post-processing? Can’t Thomas et al.’s approach generate topics that are up-to-date with a source code repository? Please elaborate more. If the technical difference between the paper and Thomas et al.’s approach is small it is better to reposition the paper as a replication study. It seems the paper investigates more systems and the findings provide additional insights not provided by Thomas et al.’s paper.</p>
  </li>
  <li>
    <p>It will be good to add some additional details to the paper to answer the following questions:</p>
  </li>
  <li>(Section IIID) Is the higher a topic distinctness score, the better a set of extracted topics is? Please elaborate more.</li>
  <li>(Section IIIE) After the encoding errors are removed, is the words that appear in a release corpus always the same as the words that appear in a changeset corpus? Why does the encoding error only affect either one of the corpus but not both of them?</li>
  <li>(Section IIIE) Please explain more how cosine distance is computed. Cosine similarity is well known, but cosine distance is not so well known.</li>
  <li>(Section IIIE) Please provide more insight on the cosine distance scores. Is a cosine distance of 0.00396 good or bad? Why some systems have much higher cosine distance score than others (e.g., 0.33957 vs. 0.00396)?</li>
  <li>
    <p>(Section IIIE) “Ant and PostgreSQL have drastically more documents in their respective change set corpora than Joda-Time and AspectJ” It is good to also mention how many documents are in the change set corpus of each of the four systems.</p>
  </li>
  <li>
    <p>There are many other studies that use topic modelling for software maintenance; it will be good to add them to the related work section especially those that use topic model for bug localization, feature location, or traceability link recovery which is the motivation of the work (as stated in the abstract), e.g.,:</p>
  </li>
  <li>Stacy K. Lukins, Nicholas A. Kraft, Letha H. Etzkorn: Bug localization using latent Dirichlet allocation. Information &amp; Software Technology 52(9): 972-990 (2010)</li>
  <li>Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, Chengnian Sun: Duplicate bug report detection with a combination of information retrieval and topic modeling. ASE 2012: 70-79</li>
  <li>Tien-Duy B. Le, Shaowei Wang, David Lo: Multi-abstraction Concern Localization. ICSM 2013: 364-367</li>
</ul>]]></content><author><name></name></author><category term="reviews" /><category term="mining software repositories" /><category term="mining unstructured data" /><category term="open science" /><category term="lda" /><category term="topic models" /><summary type="html"><![CDATA[To keep up with practicing some open science, here are the reviews to the MUD’2014 paper I “recently” published.]]></summary></entry><entry><title type="html">Reviews from MSR2014</title><link href="https://christop.club/2014/07/16/reviews-from-msr2014/" rel="alternate" type="text/html" title="Reviews from MSR2014" /><published>2014-07-16T00:00:00+00:00</published><updated>2014-07-16T00:00:00+00:00</updated><id>https://christop.club/2014/07/16/reviews-from-msr2014</id><content type="html" xml:base="https://christop.club/2014/07/16/reviews-from-msr2014/"><![CDATA[<p>I’ve been reviewing some papers for the <a href="http://icsme.org/">ICSME 2014</a>
tool demo track, and it occurred to me that I could post my own
reviews from previous published papers.
This will (hopefully) share some insight to fledgling researchers
(cough cough, me) on what a short paper review
would roughly contain.</p>

<p>So, here goes.</p>

<p>“New Features for Duplicate Bug Detection” was a study conducted
by an <a href="http://reu.cs.ua.edu">REU</a> student over the summer of 2013,
with mentoring and guidance from <a href="http://nkraft.cs.ua.edu/">Dr. Kraft</a> and myself.
Here is a link to the <a href="http://cscorley.students.cs.ua.edu/publications/pdfs/Klein-etal_14.pdf">preprint [PDF]</a>.
We submitted this to <a href="http://2014.msrconf.org/">MSR 2014</a> short paper
track, and it was accepted.</p>

<p>Below are the three reviews this paper received.
Note that these reviews were for the submission, and these
comments were geared toward that copy.
I don’t have that anywhere that I can remember,
but this should give you an idea.</p>

<p><em>Note: I’ve _slightly_ modified these with whitespace so that they render in markdown</em></p>

<h2 id="review-1">Review #1</h2>

<p>Summary: The paper proposes a technique that predicts if a pair of bug reports is a duplicate pair or not. It extends the previous work by Alipour et al. by introducing additional features that are based on the differences in the words, topics, priority, reporting time, and components of two bug reports. Several machine learning algorithms from Weka have been used to investigate the effectiveness of the proposed features. Experiments have been performed on the same Android bug report dataset as Alipour et al. The results of the experiments show that the proposed features could improve the result of Alipour et al.’s method by 3.33%, 7.24%, and 11.76% in terms of Accuracy, AUC, and Kappa.</p>

<p>Recommendation: Weak Accept</p>

<p>Pros:</p>

<ul>
  <li>A number of new features have been proposed. These features capture differences between two bug reports in terms of their words, topics, priority, reporting time, and component.</li>
  <li>Experiments using 6 classifiers have been conducted to demonstrate the value of the proposed features.</li>
  <li>The experiments on the Android datasets show that the proposed approach could improve Alipour et al.’s approach by 3.33%, 7.24%, and 11.76% in terms of Accuracy, AUC, and Kappa.</li>
</ul>

<p>Suggestion for improvement:</p>

<ul>
  <li>
    <p>Reference [2] is not the paper referred to by Alipour et al. It should be changed to:</p>

    <p>Chengnian Sun, David Lo, Siau-Cheng Khoo, Jing Jiang: Towards more accurate retrieval of duplicate bug reports. ASE 2011: 253-262</p>

    <p>Reference [2] is related to your proposed approach though since it also uses topic modelling. It has not been compared with Alipour et al.’s method. Thus please refer to it too and mention the differences between your approach and the paper, e.g., difference in setting (see next comment).</p>
  </li>
  <li>
    <p>The setting that your paper consider and the setting considered by Sun et al’s approach are different. I think there is a need to highlight the difference in the paper.</p>

    <p>In Sun et al.’s approach, the setting is: given a bug report, return a list of top-k most similar bug reports.</p>

    <p>In your approach (and Alipour et al.’s approach), the setting is: given two bug reports, predict if they are a duplicate of each other or not.</p>

    <p>Alipour et al.’s setting is first considered in the following paper:</p>

    <p>David Lo, Hong Cheng, Lucia: Mining closed discriminative dyadic sequential patterns. EDBT 2011: 21-32 (See Case Study section)</p>

    <p>This setting is actually easier, since it is easier to differentiate between “two completely random bug reports” and “duplicate bug reports”, than to differentiate between “two similar bug reports that are not duplicate of each other” and “two similar bug reports that are duplicate of each other”.</p>
  </li>
  <li>Please describe more about the evaluation metrics (i.e., Accuracy, AUC, and Kappa). In particular, please describe Kappa since it is not a very frequently used metric.</li>
  <li>Please add a related work section that more comprehensively describes work in the area of duplicate bug report detection.</li>
  <li>“Alipour et al” =&gt; “Alipour et al.”</li>
  <li>“International Workshop on Mining Software Repository” =&gt; “Working Conference on Mining Software Repository”</li>
  <li>Weimar et al =&gt; Please add a reference …</li>
  <li>I think a better title could have been: “New Features for Duplicate Bug Detection”.</li>
</ul>

<p>In general, I have no major concern with the paper. The writing could be improved in a number of ways though. There is still one more page that the authors can use to improve the writing.</p>

<h2 id="review-2">Review #2</h2>

<p>The paper proposes a new set of features to identify duplicate bugs. The efficacy of these new metrics/features is evaluated using 6 machine learning algorithms from Weka. The paper builds on the work of Alipour et al. and uses the same Android bug dataset. The experiments indicate that these new features result in an improvement in accuracy compared to Alipour et. al.’s for all the 6 learners considered.</p>

<p>Though the paper is not significantly novel, the idea of considering the first shared identical topic seems new. The results, at least for the Android data set, seems encouraging.</p>

<p>That said, it is generally necessary to evaluate a new metric rigorously and on several benchmark data sets before we claim that the metric is better. Since using shared identical topic seems to make sense intuitively, this is a ok for a short paper.</p>

<p>Few suggestions:</p>

<ol>
  <li>
    <p>Since you have 1 additional page, and you use the same data set as Alipou’s, it would be good to show some examples of pairs of bug reports that are actually duplicates but could not be detected by Alipour et al. but was detected using your new metrics. And also vice-versa. This will also help describe your new metrics in more detail with examples.</p>
  </li>
  <li>
    <p>Ideally, one would want to know the efficacy of each <em>individual</em> metric. Which of your metrics would have a better performance? can we rank them?</p>
  </li>
  <li>
    <p>There is a problem with Table III. You say that you added REPTree that Alipour et al. did not use, but then you show the performance improvement over Alipour’s metric. This needs some additional explanation.</p>
  </li>
  <li>
    <p>In Section V, you mention Weimar et al. but do not provide any reference.</p>
  </li>
  <li>
    <p>The exposition can be improved. The new metrics (especially the one that you think is very novel) should be highlighted in the introduction itself. Currently, one needs to read till page 2 to figure out that the attributes in Table 1 are the new metrics you refer to in the paper.</p>
  </li>
</ol>

<h2 id="review-3">Review #3</h2>

<p>The authors replicate a prior bug deduplication study and apply new
metrics and evaluate performance. They improve performance across a
wide range of learners and provide a new learner that works as well.
Furthermore their technique is far more generalized than the work they
replicate and thus is more automatable.</p>

<p>First and foremost, I think these incremental improvements in mining
are actually best served in short form. I think the length of this
submission is almost appropriate (although you had an extra page for
say better descriptions of the results or more comparison).</p>

<p>Second, they argue they have a consider improvement over other techniques</p>

<p>Third, they don’t make it clear in their paper but their results all
suggest that LDA based comparisons results for bug deduping improve
performance far more than priority, time, component, or bug type.
Implying that at least in Android the meta-data is poor.</p>

<p>Questions:</p>

<ul>
  <li>
    <p>In table III what was reptree compared to ?</p>
  </li>
  <li>
    <p>Clarify this statement, you have space: To protect the validity of
our study, we ensured that no two pairs contained identical reports.</p>
  </li>
</ul>

<p>Issues:</p>

<ul>
  <li>
    <p>I think the wrong style was used for this submission the current
style is sig-alternate and you’re using something else, for instance
numbering is in roman numerals.</p>
  </li>
  <li>
    <p>There’s an extra page to go…</p>
  </li>
  <li>
    <p>I think in Alipour et al. and in this study that the application of
KNN is inappropriate. While it works, I think it violates the triangle
inequality.</p>
  </li>
  <li>
    <p>I want to see some time and space given to describing what were in
your true positives and false negatives and true negatives and false
positives.</p>
  </li>
</ul>

<p>Conclusions:</p>

<p>I think it is a nice short replication. A little more presentation
work would be appreciated but these features are easy to calculate and
easy to integrate into any deduper framework.</p>]]></content><author><name></name></author><category term="reviews" /><category term="mining software repositories" /><category term="open science" /><category term="lda" /><category term="bugs" /><summary type="html"><![CDATA[I’ve been reviewing some papers for the ICSME 2014 tool demo track, and it occurred to me that I could post my own reviews from previous published papers. This will (hopefully) share some insight to fledgling researchers (cough cough, me) on what a short paper review would roughly contain.]]></summary></entry><entry><title type="html">Using Gensim for LDA</title><link href="https://christop.club/2014/05/06/using-gensim-for-lda/" rel="alternate" type="text/html" title="Using Gensim for LDA" /><published>2014-05-06T00:00:00+00:00</published><updated>2014-05-06T00:00:00+00:00</updated><id>https://christop.club/2014/05/06/using-gensim-for-lda</id><content type="html" xml:base="https://christop.club/2014/05/06/using-gensim-for-lda/"><![CDATA[<p>This is a short tutorial on how to use Gensim for LDA topic modeling.
What is topic modeling? It is basically taking a number of documents (new
articles, wikipedia articles, books, &amp;c) and sorting them out into different
topics. For example, documents on Babe Ruth and baseball should end up in the
same topic, while <a href="https://en.wikipedia.org/wiki/Dennis_Rodman">Dennis Rodman</a> and basketball should end up in another.</p>

<p>LDA is an extension of LSI/pLSI using some crazy statistical stuff.
Most of that will not matter to us since we aren’t implementing LDA.
One important thing to consider about LDA, however, is that it is a
<a href="https://en.wikipedia.org/wiki/Mixture_model">mixture model</a>, which is statistical mumbojumbo for “documents can be
associated with more than one topic.” That is, and article about Dennis Rodman
could be related to multiple topics: basketball, tattoos, and crazy hair colors.</p>

<p>Right now, Gensim is in the process of being ported to Python 3.
This tutorial is written for Gensim 0.9.1.
I’ll assume that you’ve got Gensim installed and working on Python 2 already.</p>

<p>Let’s start, go ahead and import gensim:</p>

<div class="in_prompt">in [1]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">import</span> <span class="nn">gensim</span></code></pre></figure>

<p>In LDA, we infer a certain number of topics from a given corpus.
I prefer the Mallet format for corpora,
namely because each document has an associated document name or id.
Other formats require you to maintain this separately with a key file,
but that’s just dumb.</p>

<p>I’ve got handy a corpus of every title (already preprocessed) of the Android
issue report database.
You can download that <a href="https://drive.google.com/open?id=0BxrXGxfAKIwfUjVuSnhSVVBTZVU">here</a>.</p>

<p>Here are the first three lines (aka the first three documents (aka the first
three issue report titles))
of the corpus file:</p>

<div class="in_prompt">in [2]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">head</span> <span class="o">-</span><span class="mi">3</span> <span class="n">android</span><span class="p">.</span><span class="n">mallet</span></code></pre></figure>

<div class="output_prompt">out [2]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 en incorrect url address project
2 en good luck
3 en http proxy support
</code></pre></div></div>

<p>Luckily, Gensim supports reading this format directly!
So, let’s load up our corpus into something Gensim can use internally:</p>

<div class="in_prompt">in [3]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">corpus</span> <span class="o">=</span> <span class="n">gensim</span><span class="p">.</span><span class="n">corpora</span><span class="p">.</span><span class="n">MalletCorpus</span><span class="p">(</span><span class="s">'android.mallet'</span><span class="p">)</span></code></pre></figure>

<p>This might take awhile, because it is building some metadata about the corpus
itself.</p>

<p>Typically, you would use the corpus in a loop like so:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">document</span> <span class="ow">in</span> <span class="n">corpus</span><span class="p">:</span>
    <span class="n">blah</span><span class="p">(</span><span class="n">document</span><span class="p">)</span>
</code></pre></div></div>

<p>But, just for our purposes, let’s look at the first document it’s holding:</p>

<div class="in_prompt">in [4]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">corpus</span><span class="p">))</span></code></pre></figure>

<div class="output_prompt">out [4]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(6936, 1), (15314, 1), (300, 1), (10981, 1)]
</code></pre></div></div>

<p>Um, what? That doesn’t look anything like the first document from before.
That’s because this is the internal representation Gensim (and all of its
modeling algorithms) uses.
This is a document, but instead of a list of words, it is a list of tuples where
each tuple is
a <em>word id</em> and frequency pair.</p>

<p>So we can see word #6936 appears 1 time in the first document.
But what is word #6936?
Again, let’s do that crazy <code class="language-plaintext highlighter-rouge">next(iter(</code> business so we don’t end up going over
every document in the corpus.
Check this out:</p>

<div class="in_prompt">in [5]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">word_id</span><span class="p">,</span> <span class="n">freq</span> <span class="ow">in</span> <span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">corpus</span><span class="p">)):</span>
    <span class="k">print</span><span class="p">(</span><span class="n">corpus</span><span class="p">.</span><span class="n">id2word</span><span class="p">[</span><span class="n">word_id</span><span class="p">],</span> <span class="n">freq</span><span class="p">)</span></code></pre></figure>

<div class="output_prompt">out [5]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>incorrect 1
url 1
address 1
project 1
</code></pre></div></div>

<div class="in_prompt">in [6]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">head</span> <span class="o">-</span><span class="mi">1</span> <span class="n">android</span><span class="p">.</span><span class="n">mallet</span></code></pre></figure>

<div class="output_prompt">out [6]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 en incorrect url address project
</code></pre></div></div>

<p>Badass, yeah?</p>

<p><br /><br /></p>

<p>Okay, not really, that’s not very interesting.
I did something a little different here, and that’s using the <code class="language-plaintext highlighter-rouge">corpus.id2word</code>
attribute.
It’s simply a Python dictionary that maps <code class="language-plaintext highlighter-rouge">id-&gt;word</code> for all words in the
corpus.</p>

<p>Alright, let’s actually generate a model (go ahead and get a sandwich, it’ll be
a minute):</p>

<div class="in_prompt">in [7]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span> <span class="o">=</span> <span class="n">gensim</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">LdaModel</span><span class="p">(</span><span class="n">corpus</span><span class="p">,</span> <span class="n">id2word</span><span class="o">=</span><span class="n">corpus</span><span class="p">.</span><span class="n">id2word</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="s">'auto'</span><span class="p">,</span> <span class="n">num_topics</span><span class="o">=</span><span class="mi">25</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">'android.lda'</span><span class="p">)</span>
<span class="c1">#model = gensim.models.LdaModel.load('android.lda')</span></code></pre></figure>

<p>We can save/load the model for later use
instead of having to rebuild it every time, as shown in the comment.
As much as I enjoy sandwiches, I don’t want to do this all the time.</p>

<p>There are a couple of parameters other than the corpurs that I’ve set there.
Let’s talk about those for a sec:</p>

<ol>
  <li><strong>id2word</strong>: Although you can build a model from just a corpus, I’ve gone
ahead and let the LdaModel know about the <code class="language-plaintext highlighter-rouge">corpus.id2word</code>.
It just makes some of the things I’ll show you next nicer.</li>
  <li><strong>alpha</strong>: This particular LDA implementation uses something that can
automatically update the <code class="language-plaintext highlighter-rouge">alpha</code> value for us.
This determines how ‘smooth’ the model is, which makes no damned sense if you
aren’t working in the area (it doesn’t make much sense to me).
Here’s what alpha does: as it gets smaller, each document is going to be <em>more
specific</em>, i.e., likely to only made up of a few topics. As it gets
bigger, a document can begin to appear in multiple topics, which is what we want.
It’s not good to have a large alpha either, because then all our topics will
start intermingling and making out and that’s gross.
I have no idea how the <code class="language-plaintext highlighter-rouge">'auto'</code> setting really works, but it seems pretty legit
to me so I’ll just use that for now.</li>
  <li><strong>num_topics</strong>:
The <code class="language-plaintext highlighter-rouge">num_topics</code> parameter just determines how many topics we want the model to
give us.
I’ve used 25 here since we are only looking at a corpus of titles.</li>
</ol>

<p>Let’s look at a few random topics:</p>

<div class="in_prompt">in [8]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">.</span><span class="n">show_topics</span><span class="p">(</span><span class="n">topics</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">topn</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span></code></pre></figure>

<div class="output_prompt">out [8]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['0.047*link + 0.027*ui + 0.018*main + 0.017*level + 0.016*locale',
 '0.107*tap + 0.047*popup + 0.045*appears + 0.031*request + 0.029*tab',
 '0.120*play + 0.096*ics + 0.084*music + 0.049*bug + 0.030*android',
 '0.106*device + 0.078*google + 0.060*talk + 0.057*voice + 0.044*icon',
 '0.191*screen + 0.055*button + 0.034*change + 0.032*page + 0.032*lock']
</code></pre></div></div>

<p>These are the top 5 words associated with 5 random topics.
The decimal number is the <em>weight</em> of the word it is multiplying,
i.e., how much does this word influence the particular topic.
The model knows how to do this because we gave it the <code class="language-plaintext highlighter-rouge">id2word</code> dictionary.
Without it, we wouldn’t be able to read this output (still).</p>

<p>Now, let’s do something actually useful: query the model.</p>

<p>Let’s say we would like to know which topics a certain string is most associated
with.</p>

<div class="in_prompt">in [9]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">query</span> <span class="o">=</span> <span class="s">'google maps broken navigation'</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">query</span><span class="p">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">query</span></code></pre></figure>

<div class="output_prompt">out [9]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['google', 'maps', 'broken', 'navigation']
</code></pre></div></div>

<p>We query the model by indexing it with our query!
But first, we need to transform it into a representation the model understands.
We can’t just do this (yet):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">[</span><span class="n">query</span><span class="p">]</span>
</code></pre></div></div>
<p>That will definitely cause us some heartache, because the query is just words.
LDA technically knows nothing about the actual words, just the ids we’ve given
them.</p>

<p>So, let’s build something to translate those words back to ids and their
frequencies.
Gensim has an awesome built in way of doing this called a Dictionary.
Sure, we <em>could</em> use regular old Python <code class="language-plaintext highlighter-rouge">dict</code>s to map <code class="language-plaintext highlighter-rouge">id-&gt;word</code> and build the
<code class="language-plaintext highlighter-rouge">(word, frequency)</code> pairs ourselves,
but I’m a fancy person that enjoys fancy things.</p>

<p>Here’s what we do:</p>

<div class="in_prompt">in [10]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">id2word</span> <span class="o">=</span> <span class="n">gensim</span><span class="p">.</span><span class="n">corpora</span><span class="p">.</span><span class="n">Dictionary</span><span class="p">()</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">id2word</span><span class="p">.</span><span class="n">merge_with</span><span class="p">(</span><span class="n">corpus</span><span class="p">.</span><span class="n">id2word</span><span class="p">)</span></code></pre></figure>

<p>This creates an empty special Dictionary, and then we merge our original corpus
dictionary into it. Whatever merge_with returns isn’t important to us, so throw
it in the Python garbage bin, underscore.</p>

<p>This doesn’t seem to gain us much, until we want to translate an entire document
into <code class="language-plaintext highlighter-rouge">(word, frequency)</code> pairs:</p>

<div class="in_prompt">in [11]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">query</span> <span class="o">=</span> <span class="n">id2word</span><span class="p">.</span><span class="n">doc2bow</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="n">query</span></code></pre></figure>

<div class="output_prompt">out [11]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(1754, 1), (6081, 1), (8441, 1), (9208, 1)]
</code></pre></div></div>

<div class="in_prompt">in [12]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">[</span><span class="n">query</span><span class="p">]</span></code></pre></figure>

<div class="output_prompt">out [12]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(3, 0.20387470260323143),
 (9, 0.35862973787398261),
 (15, 0.010585652382570768),
 (16, 0.010899567346349904),
 (18, 0.011132829837161632),
 (21, 0.22968681811101002),
 (22, 0.010344492016793241),
 (23, 0.010589823218917306),
 (24, 0.010154742173706556)]
</code></pre></div></div>

<p><em>Note: your results absolutely should differ from mine _slightly_, given
the probablistic nature of the model</em></p>

<p>Awwwwww yeahhhhhhhhhhh.
Now we’re <em>cookin’ with gas</em>.</p>

<p>From this list, we have each topic and the likelihood that the <code class="language-plaintext highlighter-rouge">query</code> relates
to that topic.
So, if we sort this a little more meaningfully:</p>

<div class="in_prompt">in [13]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">model</span><span class="p">[</span><span class="n">query</span><span class="p">],</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span></code></pre></figure>

<div class="output_prompt">out [13]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(24, 0.010154742173743013)
(9, 0.35859622416422271)
</code></pre></div></div>

<p>We can see that the least and the most related topic to our document.
Let’s check out what words are most associated with those two topics.</p>

<div class="in_prompt">in [14]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">.</span><span class="n">print_topic</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="c1">#least related</span></code></pre></figure>

<div class="output_prompt">out [14]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'0.063*apps + 0.062*wifi + 0.058*calendar + 0.044*exchange + 0.035*changing + 0.030*location + 0.027*latitude + 0.024*automatically + 0.021*event + 0.020*disappears'
</code></pre></div></div>

<div class="in_prompt">in [15]:</div>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">.</span><span class="n">print_topic</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="c1">#most related</span></code></pre></figure>

<div class="output_prompt">out [15]:</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'0.155*maps + 0.086*issue + 0.054*google + 0.034*android + 0.031*unlock + 0.024*books + 0.020*coming + 0.016*failed + 0.015*note + 0.013*word'
</code></pre></div></div>

<p>So, the first one looks like garbage for our query, but the second seems to be
mostly about the Google specific applications, including maps! Not the best
results, so this model’s number of topics probably needs to be a bit higher, or
<code class="language-plaintext highlighter-rouge">alpha</code> values played with until results pan out.</p>

<p>Note how our initial query only returned nine or so related topics. Didn’t we
ask for 25 of them? Well, we did, but Gensim defaults to only showing the top
ones that meet a certain threshold (<code class="language-plaintext highlighter-rouge">&gt;= 0.01</code>). Digging deeper than that is
ugly, so for now we will just deal with these results.</p>

<p>I am getting pretty tired of looking at this, so I think this will conclude the
tutorial on using Gensim’s LDA stuff for now. Go ahead and try out this code for
yourself.</p>

<p>This notebook on “Using Gensim for LDA” is available for download
<a href="/notebooks/Using Gensim for LDA.ipynb">here</a>.</p>]]></content><author><name></name></author><category term="python" /><category term="topic modeling" /><category term="gensim" /><category term="lda" /><summary type="html"><![CDATA[This is a short tutorial on how to use Gensim for LDA topic modeling. What is topic modeling? It is basically taking a number of documents (new articles, wikipedia articles, books, &amp;c) and sorting them out into different topics. For example, documents on Babe Ruth and baseball should end up in the same topic, while Dennis Rodman and basketball should end up in another.]]></summary></entry><entry><title type="html">Blogging with IPython and Jekyll</title><link href="https://christop.club/2014/02/21/blogging-with-ipython-and-jekyll/" rel="alternate" type="text/html" title="Blogging with IPython and Jekyll" /><published>2014-02-21T00:00:00+00:00</published><updated>2014-02-21T00:00:00+00:00</updated><id>https://christop.club/2014/02/21/blogging-with-ipython-and-jekyll</id><content type="html" xml:base="https://christop.club/2014/02/21/blogging-with-ipython-and-jekyll/"><![CDATA[<p>Lately I’ve been using <a href="http://ipython.org/">IPython</a> to do most of my tinkering work.
It’s pretty neat, to say the least.</p>

<p>I’ve seen around the Internet people using IPython as a way to blog.
I thought that this would be a pretty neat way to go about things, and
probably save a large amount of time on editing code-centric blog posts.
However, the methods I found were either outdated,
outputted HTML (usually with gross CSS conflicts),
were hacks for other blogging software, or required a plugin.</p>

<p>Since I use <a href="http://pages.github.com/">Github Pages</a> (read: <a href="http://jekyllrb.com/">Jekyll</a>) to auto-render my blog, I
decided to
code up my own method.
My method outputs files that are in Markdown with a Jekyll front matter pre-
filled.
This way, I can still add blog posts in the same format as before and edit if
needed.
No plugins are required this way, too. Sure, it is manual conversion of
notebooks, but
that’s pretty much the only way to get around the plugin issue <em>and</em> still be
able to use Github Pages.</p>

<p>Here are the files you will need to publish a notebook to Jekyll: <a href="https://gist.github.com/cscorley/9144544">https://gist.
github.com/cscorley/9144544</a></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">jekyll.py</code>: This is the config file used for conversion. It should be placed
wherever the profile you are using is. Default is <code class="language-plaintext highlighter-rouge">~/.ipython/profile_default/</code></li>
  <li><code class="language-plaintext highlighter-rouge">jekyll.tpl</code>: I plop all my template files into <code class="language-plaintext highlighter-rouge">~/.ipython/templates</code>, but
put <code class="language-plaintext highlighter-rouge">jekyll.tpl</code> wherever suits you best (just be sure to change the jekyll.py
to point to that location, also)</li>
</ul>

<p>Everything will output into a folder named <code class="language-plaintext highlighter-rouge">notebooks</code>.
You can change this by replacing in the config all instances of ‘notebooks’ with
whatever you want.</p>

<p>There is one variable in the config named <code class="language-plaintext highlighter-rouge">BLOG_DIR</code>
that is used to automatically generate the markdown
and any support files the notebook needs into this directory.
Right now it reads from the environment variable of the same name.
You will need to export a <code class="language-plaintext highlighter-rouge">$BLOG_DIR</code> environment variable to be able to use
this script as-is.
This is important because the configuration file <code class="language-plaintext highlighter-rouge">jekyll.py</code> will use this
variable unless the configuration is changed.
If you just want them to plop into the current directory, change it in the
config to an empty string.</p>

<p>Finally, you can now run your conversion <em>on a single file</em> with the command:
<code class="language-plaintext highlighter-rouge">ipython nbconvert --config jekyll.py &lt;FILENAME&gt;</code>.</p>

<p>I did this whole <code class="language-plaintext highlighter-rouge">$BLOG_DIR</code> and <code class="language-plaintext highlighter-rouge">notebooks</code> mess because Jekyll was pooping out
whenever a markdown file appeared in the notebooks folder I was using. I also
wanted the notebooks folder so nbconvert would know where to place any support
files, and that Jekyll would blindly copy these into the generated site. Plus, a
place to put the notebook files themselves so they can be <a href="/notebooks/Blogging with IPython and Jekyll.ipynb">downloaded
directly</a>! Nice, yeah?</p>

<p>Here’s a shell function I wrote to convert a notebook file and then move any
markdown files created into the <code class="language-plaintext highlighter-rouge">_drafts</code> folder.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export BLOG_DIR="/Users/cscorley/git/cscorley.github.io"
nbconvert(){
    ipython nbconvert --config jekyll.py $@;
    find ${BLOG_DIR}/notebooks/ -name '*.md' -exec mv {} ${BLOG_DIR}/_drafts/ \;
    cp $@ ${BLOG_DIR}/notebooks/
}
</code></pre></div></div>

<p>That’s all. I just do <code class="language-plaintext highlighter-rouge">nbconvert FILE</code> now and it just works. Jekyll doesn’t
kill itself over it. When I’m done checking that the post is ready to go live, I
move it into the <code class="language-plaintext highlighter-rouge">_posts</code> folder. No big deal, right?</p>

<p>Below is some example code!</p>

<p><strong>In [1]:</strong></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Pizza</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">toppings</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">toppings</span> <span class="o">=</span> <span class="n">toppings</span>
        
    <span class="k">def</span> <span class="nf">is_yummy</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">True</span>

<span class="n">p</span> <span class="o">=</span> <span class="n">Pizza</span><span class="p">([</span><span class="s">'pineapple'</span><span class="p">,</span> <span class="s">'cheese'</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">is_yummy</span><span class="p">())</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>

<p><strong>In [2]:</strong></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Populating the interactive namespace from numpy and matplotlib
</code></pre></div></div>

<p>Some code copied from <a href="https://en.wikipedia.org/wiki/Matplotlib">Wikipedia</a>:</p>

<p><strong>In [3]:</strong></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">a</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>

<p><img src="/notebooks/blogging-with-ipython-and-jekyll_files/blogging-with-ipython-and-jekyll_4_0.png" alt="png" /></p>]]></content><author><name></name></author><category term="python" /><category term="notebook" /><summary type="html"><![CDATA[Lately I’ve been using IPython to do most of my tinkering work. It’s pretty neat, to say the least.]]></summary></entry><entry><title type="html">Object Environment</title><link href="https://christop.club/2014/02/19/object-environment/" rel="alternate" type="text/html" title="Object Environment" /><published>2014-02-19T00:00:00+00:00</published><updated>2014-02-19T00:00:00+00:00</updated><id>https://christop.club/2014/02/19/object-environment</id><content type="html" xml:base="https://christop.club/2014/02/19/object-environment/"><![CDATA[<p>Students often have trouble grasping the difference between objects,
classes, and the variables which hold them. This article aims to explain
object oriented programming by example in Python.</p>

<h2 id="review">Review</h2>

<p>First, let us review a few things.</p>

<h3 id="variables">Variables</h3>

<p>To create a variable in Python, we simply need to assign it a value:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">b</span> <span class="o">=</span> <span class="s">"Tacos"</span></code></pre></figure>

<p>Let’s consider mapping these variables out as we go into something I’m
going to call an <em>environment</em>. Environments are simply tables that map
the known variables to their values. For example, the code above would
have the following environment:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | int      | 10
        b           | str      | "Tacos"
</code></pre></div></div>

<p>That is, a is a variable that holds the integer 10. We can add new
variables to the environment at will.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">not_my_gpa</span> <span class="o">=</span> <span class="mf">4.0</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | int      | 10
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
</code></pre></div></div>

<p>That isn’t very interesting. Neither would be changing a variable.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
</code></pre></div></div>

<p>If we wanted to use a variable, then Python would have to look up its
value in the environment table.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="c1"># finds variable b and gives it to the 'print' function</span></code></pre></figure>

<p>Sometimes while debugging through a program, it is handy to keep an
environment table updated for each step of execution in the program.
This is known as <em>tracing a program</em>.</p>

<h3 id="functions">Functions</h3>

<p>Functions are little snippets of code that complete tasks for us. Say we
wanted to write a function that calculates the square of a number. It
might look like this:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">square</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">val</span> <span class="o">*</span> <span class="n">val</span></code></pre></figure>

<p>Now, some cool cool stuff happens here when we create <code class="language-plaintext highlighter-rouge">square</code>. First,
it is added to the environment table. Yep, <code class="language-plaintext highlighter-rouge">square</code> is pretty much just
a variable name.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
</code></pre></div></div>

<p>I’ve left the value empty because functions are special. Something <em>is</em>
there and it’s the body of the function.</p>

<p>Let’s call square and see what happens to our environment table.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">c</span> <span class="o">=</span> <span class="n">square</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span></code></pre></figure>

<p>There are several steps that happen here. First, we can see that the
value is going to be stored in to a variable c, but we don’t actually
know <em>what</em> value yet. So, Python will evaluate the function call for
us. Whenever Python sees a variable name followed by some parentheses,
possibly with arguments such as <code class="language-plaintext highlighter-rouge">10</code>, it knows it’s got to do some stuff
for us.</p>

<p>Python will first retrieve the value at the variable <code class="language-plaintext highlighter-rouge">square</code> in our
environment. Then, it will execute the code associated it (the value)
given the arguments. Something special happens then with those
arguments. When the function is evaluated, the arguments are set up in
<em>yet another environment table</em>, specifically for this <em>single</em> call to
<code class="language-plaintext highlighter-rouge">square</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |

Function call-&gt; square(10):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 10
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">square</code>
finishes up, it will return the value <code class="language-plaintext highlighter-rouge">100</code>, which we can then assign to
a new variable <code class="language-plaintext highlighter-rouge">c</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        c           | int      | 100
</code></pre></div></div>

<p>Note that the <code class="language-plaintext highlighter-rouge">square(10)</code> environment is destroyed because it is no longer
needed! If we called <code class="language-plaintext highlighter-rouge">square</code> again, a new environment will be created
specifically for it and whatever argument we give it.</p>

<p>Let’s look at another example:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">power_of_c</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
    <span class="n">z</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
        <span class="n">z</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="n">c</span>
    <span class="k">return</span> <span class="n">z</span></code></pre></figure>

<p>Oh geez, this function is <em>drunk</em>. It uses something that is given as an
argument, creates its own variables, and even uses some outside of it.
How is that possible? It is possible through something known as
<em>scoping</em>. If we call <code class="language-plaintext highlighter-rouge">power_of_c</code>, an environment is created
specifically for it, just like when <code class="language-plaintext highlighter-rouge">square</code> was called.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">d</span> <span class="o">=</span> <span class="n">power_of_c</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100

Function call-&gt; power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
</code></pre></div></div>

<p>Now the function begins to execute. The first thing that happens is that
it creates a new variable, <code class="language-plaintext highlighter-rouge">z</code>, and gives it the value 1.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100

Function call-&gt; power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 1
</code></pre></div></div>

<p>Note that <code class="language-plaintext highlighter-rouge">z</code> is created <em>within</em> the <code class="language-plaintext highlighter-rouge">power_of_c(3)</code> environment. Next,
we begin our loop and start updating <code class="language-plaintext highlighter-rouge">z</code> with <code class="language-plaintext highlighter-rouge">z * c</code>. First loop
through <code class="language-plaintext highlighter-rouge">z</code> will become 100, since <code class="language-plaintext highlighter-rouge">c</code> is 100 and <code class="language-plaintext highlighter-rouge">1 * 100 == 100</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Function call-&gt; power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 100
</code></pre></div></div>

<p>A second time,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Function call-&gt; power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 10000
</code></pre></div></div>

<p>And I think we can see how this ends: with <code class="language-plaintext highlighter-rouge">z</code> holding integer 1000000.
Finally, <code class="language-plaintext highlighter-rouge">power_of_c(3)</code> returns the value held within <code class="language-plaintext highlighter-rouge">z</code>, the
environment is destroyed, and our new variable is created.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100
        d           | int      | 1000000
</code></pre></div></div>

<p>But how did <code class="language-plaintext highlighter-rouge">power_of_c</code> know where to find <code class="language-plaintext highlighter-rouge">c</code> if it wasn’t in its
environment? It knows because the environments are <em>nested</em> in a sense.
That is, if a variable does not exist within the inner most environment,
Python will try to look it up in the next environment up, or the
environment that was in <em>scope</em> when our new environment was created,
which in our case, is our main environment we started with. Let’s go
ahead and give that environment a name, how about <code class="language-plaintext highlighter-rouge">global</code>? Sounds good
to me.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100
        d           | int      | 1000000
</code></pre></div></div>

<p>This environment table is special to our program, it’s basically where
everything is going to be defined.</p>

<h2 id="classes-and-objects">Classes and Objects</h2>

<p>Alright, now that we’re good with how environments work, let’s finally
create some classes. Let’s start with a fresh, empty environment.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">b</span> <span class="o">=</span> <span class="s">"Tacos"</span>

<span class="k">class</span> <span class="nc">Fraction</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">numerator</span> <span class="o">=</span> <span class="n">n</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">denominator</span> <span class="o">=</span> <span class="n">d</span></code></pre></figure>

<p>This class will represent a fraction. A fraction has two parts:
a numerator and a denominator. Now our global environment looks
something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
</code></pre></div></div>

<p>Again, I’ve left the value of the <code class="language-plaintext highlighter-rouge">Fraction</code> variable empty. Why?
Because it’s going to operate just like a function did in a sense. Let’s
make some stuff and see what happens!</p>

<p>To use a class, we call it just like we would a function:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">half</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span></code></pre></figure>

<p>Python knows what’s up when we do this, and handles “calling” the class
specially. First, we create a new <code class="language-plaintext highlighter-rouge">Fraction</code> with values 1 and 2. What
happens is that Python realizes we are trying to do a call on a class,
hands off everything to the constructor, known in Python as  <code class="language-plaintext highlighter-rouge">__init__</code>,
and calls it instead.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Create object--&gt; Fraction(1,2):
                Variable    | Type     | Value
                ------------------------------
                self        | object   | *
                n           | int      | 1
                d           | int      | 2
</code></pre></div></div>

<p>Or, more specifically:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---&gt; Fraction.__init__(*, 1,2):
                Variable    | Type     | Value
                ------------------------------
                self        | object   | *
                n           | int      | 1
                d           | int      | 2
</code></pre></div></div>

<p>So, if you were like me back when I was first learning this stuff, you
are asking yourself, <em>“what the hell is <code class="language-plaintext highlighter-rouge">self</code> and why does <code class="language-plaintext highlighter-rouge">__init__</code>
get called with three parameters when I only gave Fraction two
arguments?</em>” It’s because the <code class="language-plaintext highlighter-rouge">self</code> parameter is going to be the object
we just created. Python is giving us a chance to <em>init</em>ialize some
values for this new object before it returns it and assigns it to the
variable <code class="language-plaintext highlighter-rouge">half</code>. (Real answer: mostly because Python is stupid.)</p>

<h3 id="what-the-hell-is-an-object">What the hell is an object?!</h3>

<p>Aye. Now we’re at the meat of the subject. An object is simply a thing.
Alright, cya next time!</p>

<p>&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;&lt;br &gt;</p>

<p>Just kidding.</p>

<p>A handy thing to do is to think of objects as their own <em>environments</em>.
So, when <code class="language-plaintext highlighter-rouge">__init__</code> is called, it is given 1 and 2, and some object
we’ve named <code class="language-plaintext highlighter-rouge">self</code>. This <code class="language-plaintext highlighter-rouge">self</code> variable is just a reference to a new
environment table!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---&gt; Fraction.__init__(*, 1,2):
            Variable    | Type     | Value
            ------------------------------
            self        | object   | *------\
            n           | int      | 1      |
            d           | int      | 2      |
                                            |
                                            |
    /---------------------------------------/
    |
    V
&lt;Fraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
</code></pre></div></div>

<p>Right now it’s empty, but that’s because <code class="language-plaintext highlighter-rouge">__init__</code> has just started to
execute. What does it do?</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Fraction</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">numerator</span> <span class="o">=</span> <span class="n">n</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">denominator</span> <span class="o">=</span> <span class="n">d</span></code></pre></figure>

<p>Hm. It uses some sort of dot notation to assign the arguments to
variables. Where are these variables created? Within <code class="language-plaintext highlighter-rouge">self</code>! Think of
that dot as “we must go deeper in the environments.”</p>

<!---- office space gif, way way down ----->

<p>First it creates a new variable <em>within</em> <code class="language-plaintext highlighter-rouge">self</code> named <code class="language-plaintext highlighter-rouge">numerator</code>, and
assigns it the value of <code class="language-plaintext highlighter-rouge">n</code>. Then the same for the <code class="language-plaintext highlighter-rouge">denominator</code> and
<code class="language-plaintext highlighter-rouge">d</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---&gt; Fraction.__init__(*, 1,2):
            Variable    | Type     | Value
            ------------------------------
            self        | object   | *------\
            n           | int      | 1      |
            d           | int      | 2      |
                                            |
                                            |
    /---------------------------------------/
    |
    V
&lt;Fraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2
</code></pre></div></div>

<p>Welp, that about wraps that up. <code class="language-plaintext highlighter-rouge">__init__</code> finishes, <em>implicitly</em>
returns <code class="language-plaintext highlighter-rouge">self</code>, and destroys its environment. We are now left with
something that looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
                                            |
                                            |
    /---------------------------------------/
    |
    V
&lt;Fraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2
</code></pre></div></div>

<p>Note how the value of <code class="language-plaintext highlighter-rouge">half</code> points to that environment representing the
new object. These are known as <em>pointers</em> in other languages, such as C.
(Yep, we’re real creative with names in computer science.) Also, its
<em>type</em> is a <code class="language-plaintext highlighter-rouge">Fraction</code>.</p>

<p>So, let’s do something with our new fraction. What is its value
represented as a float (decimal)?</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">d</span> <span class="o">=</span> <span class="n">half</span><span class="p">.</span><span class="n">numerator</span> <span class="o">/</span> <span class="n">half</span><span class="p">.</span><span class="n">denominator</span></code></pre></figure>

<p>Again, notice the dot notation and how it allows us to access the
environment within <code class="language-plaintext highlighter-rouge">half</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
                                            |
                                            |
    /---------------------------------------/
    |
    V
&lt;Fraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2
</code></pre></div></div>

<p>Let’s create a few more fractions and have some fun.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">third</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">almost_pi</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">22</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span></code></pre></figure>

<p>Now our set of environments looks like this (I’ve left out the calls to
<code class="language-plaintext highlighter-rouge">__init__</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                            |   |   |
    /---------------------------------------/   |   |
    |                                           |   |
    V                                           |   |
&lt;Fraction&gt; object #1:                           |   |
        Variable    | Type     | Value          |   |
        ------------------------------          |   |
        numerator   | int      | 1              |   |
        denominator | int      | 2              |   |
                                                |   |
    /-------------------------------------------/   |
    |                                               |
    V                                               |
&lt;Fraction&gt; object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
                                                    |
    /-----------------------------------------------/
    |
    V
&lt;Fraction&gt; object #3:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 22
        denominator | int      | 7
</code></pre></div></div>

<p>Converting our fraction to a float might be useful enough to put in its
own fuction.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">to_float</span><span class="p">(</span><span class="n">f</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">f</span><span class="p">.</span><span class="n">numerator</span> <span class="o">/</span> <span class="n">f</span><span class="p">.</span><span class="n">denominator</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
                                            |   |   |
                                           ... ... ...
</code></pre></div></div>

<p>To use <code class="language-plaintext highlighter-rouge">to_float</code>, we give it an entire Fraction object. Yup. The whole
thing.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">many_three</span> <span class="o">=</span> <span class="n">to_float</span><span class="p">(</span><span class="n">third</span><span class="p">)</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
                                            |   |   |
                                           ...  |  ...
                                                |
                                                |
    /-------------------------------------------+---\
    |                                               |
    V                                               |
&lt;Fraction&gt; object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
                                                    |
Function call-&gt; to_float(third):                    |
                Variable    | Type     | Value      |
                ------------------------------      |
                f           | Fraction | *----------/
</code></pre></div></div>

<p>Notice when <code class="language-plaintext highlighter-rouge">to_float(third)</code>’s environment is created, its parameter
<code class="language-plaintext highlighter-rouge">f</code> points to the same fraction as the argument <code class="language-plaintext highlighter-rouge">third</code>. When
<code class="language-plaintext highlighter-rouge">to_float</code> begins execution, it will use the dot notation to access
values <em>within</em> <code class="language-plaintext highlighter-rouge">f</code>, or as it is here, <code class="language-plaintext highlighter-rouge">third</code>.</p>

<p>We can apply <code class="language-plaintext highlighter-rouge">to_float</code> a few times to different <code class="language-plaintext highlighter-rouge">Fraction</code>s and the
same thing will happen every time.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">zero_five</span> <span class="o">=</span> <span class="n">to_float</span><span class="p">(</span><span class="n">half</span><span class="p">)</span>
<span class="n">pi_ish</span>    <span class="o">=</span> <span class="n">to_float</span><span class="p">(</span><span class="n">almost_pi</span><span class="p">)</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
        many_three  | float    | 0.333...   |   |   |
        zero_five   | float    | 0.5        |   |   |
        pi_ish      | float    | 3.14...   |   |   |
                                            |   |   |
                                           ... ... ...
</code></pre></div></div>

<p>Neat-o.</p>

<h3 id="methods">Methods</h3>

<p>Alright. Time to introduce something new. Method, as defined in the
Oxford English Dictionary is:</p>

<p>method, <em>n.</em></p>

<p>A procedure for attaining an object.</p>

<ol>
  <li>A recommended or prescribed medical treatment for a specific disease.</li>
  <li>More generally: a way of doing anything, esp. according to
 a defined and regular plan; a mode of procedure in any activity,
 business, etc.</li>
</ol>

<p>Actually, this is close enough I can stop here, because if you have
learned anything in computer science yet, you know that we name things
in a <em>sort-of-but-not-really</em> fashion. Here’s our definition of method:</p>

<p>method, <em>n.</em></p>

<p>A procedure related to an object.</p>

<ol>
  <li>See definition for <em>function</em>.</li>
</ol>

<p>What I’m trying to get at is that there is no practical difference
between functions and methods other than <em>methods are defined within
a class and become part of the environment for objects created from that
class</em>.</p>

<p>Let’s suppose our Fraction class had the <code class="language-plaintext highlighter-rouge">to_float</code> function built right
in. Starting with a fresh global environment:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">b</span> <span class="o">=</span> <span class="s">"Tacos"</span>

<span class="k">class</span> <span class="nc">Fraction</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">numerator</span> <span class="o">=</span> <span class="n">n</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">denominator</span> <span class="o">=</span> <span class="n">d</span>

    <span class="k">def</span> <span class="nf">to_float</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">numerator</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">denominator</span>

<span class="n">half</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">third</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">almost_pi</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">22</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span></code></pre></figure>

<p>Now our all our environments are structured like <em>this</em>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                            |   |   |
    /---------------------------------------/   |   |
    |                                           |   |
    V                                           |   |
&lt;Fraction&gt; object #1:                           |   |
        Variable    | Type     | Value          |   |
        ------------------------------          |   |
        numerator   | int      | 1              |   |
        denominator | int      | 2              |   |
        to_float    | function |                |   |
                                                |   |
    /-------------------------------------------/   |
    |                                               |
    V                                               |
&lt;Fraction&gt; object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
        to_float    | function |                    |
                                                    |
    /-----------------------------------------------/
    |
    V
&lt;Fraction&gt; object #3:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 22
        denominator | int      | 7
        to_float    | function |
</code></pre></div></div>

<p>P rad, yeah? Now each <code class="language-plaintext highlighter-rouge">Fraction</code> object has its <em>own</em> <code class="language-plaintext highlighter-rouge">to_float</code>, much
like how it has its own numerator and denominator. So, how can we use
it?</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">zero_five</span> <span class="o">=</span> <span class="n">half</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span>
<span class="n">many_three</span> <span class="o">=</span> <span class="n">third</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span>
<span class="n">pi_ish</span> <span class="o">=</span> <span class="n">almost_pi</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span></code></pre></figure>

<!-- mind blown gif  -->

<p>Yep, we use the <em>same</em> dot notation as before, only this time we attach
a <code class="language-plaintext highlighter-rouge">()</code> to the end so Python knows we’re calling a <s>function</s>
method.</p>

<p>A call to third.to_float() creates environments just like before, only
now <code class="language-plaintext highlighter-rouge">self</code> is the pointer to <code class="language-plaintext highlighter-rouge">third</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                           ...  |  ...
                                                |
    /---------------------------------------+---/
    |                                       |
    V                                       |
&lt;Fraction&gt; object #2:                       |
        Variable    | Type     | Value      |
        ------------------------------      |
        numerator   | int      | 1          |
        denominator | int      | 3          |
        to_float    | function |            |
                                            |
Method call-&gt; third.to_float():             |
            Variable    | Type     | Value  |
            ------------------------------  |
            self        | Fraction | *------/
</code></pre></div></div>

<p><strong>*busts an air guitar solo*</strong></p>

<h3 id="most-things-are-object-like">Most things are object-like</h3>

<p>In Python, you can treat just about everything like an object, even
strings.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">b</span> <span class="o">=</span> <span class="s">"Tacos"</span>
<span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>       <span class="c1"># prints "Tacos" to screen
</span><span class="n">c</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">upper</span><span class="p">()</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">swapcase</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>       <span class="c1"># prints "TACOS" to screen
</span><span class="k">print</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>       <span class="c1"># prints "tACOS" to screen</span></code></pre></figure>

<p>Neat, yeah? So that means… <em>dun dun dunnnnnnnnnnnnnn</em>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | *------------------------------\
        b           | str      | *------------------------------)---\
        Fraction    | class    |                                |   |
        half        | Fraction | *----------\                   |   |
        third       | Fraction | *----------)---\               |   |
        almost_pi   | Fraction | *----------)---)---\           |   |
        c           | str      | *----------)---)---)---\       |   |
        d           | str      | *----------)---)---)---)---\   |   |
                                            |   |   |   |   |   |   |
                                           ... ... ... ... ... ...  |
                                                                    |
    /---------------------------------------------------------------/
    |
    V
&lt;str&gt; object #1: "Tacos"
        Variable    | Type     | Value
        ------------------------------
        upper       | function |
        swapcase    | function |
        ...         | ...      |
</code></pre></div></div>

<!-- mother of god -->

<p>Yeah, I left a lot out. I am getting lazy and all this taco-talk is
making me hungry, but I think you get the idea: the environment actually
just holds <em>pointers</em> to all the objects for variables.</p>

<h3 id="classes-holding-objects-that-are-classes-holding-objects-that-are">Classes holding objects that are classes holding objects that are…</h3>

<p>Alright, let’s get real crazy here before I go eat. In addition to our
Fraction class, we’ll add ourselves a MixedFraction. MixedFractions are
whole numbers (ints) and Fraction objects combined together like peanut
butter and jelly. It’s beautiful.</p>

<p>And while we’re at it, let’s go on and create a <code class="language-plaintext highlighter-rouge">to_float</code> method that
will convert the mixed fraction into a floating point number.</p>

<p>Here goes:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">MixedFraction</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">whole_num</span><span class="p">,</span> <span class="n">fraction_obj</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">whole_num</span> <span class="o">=</span> <span class="n">whole_num</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fraction_obj</span> <span class="o">=</span> <span class="n">fraction_obj</span>

    <span class="k">def</span> <span class="nf">to_float</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">val</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">whole_num</span><span class="p">)</span>
                <span class="c1"># float() is a built-in function that
</span>                <span class="c1"># can convert integers to floats.
</span>
        <span class="n">val</span> <span class="o">+=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fraction_obj</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span>
                <span class="c1"># ask the fraction for its floating point value!
</span>
        <span class="k">return</span> <span class="n">ret</span></code></pre></figure>

<p>That’s pretty straight forward, yeah? This is known as an <em>aggregation</em>
relationship, as MixedFraction is composed of a Fraction, but isn’t
responsible for it (i.e., it was created outside of the class.)</p>

<p>Let’s make some MixedFractions and look at the environment.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">half</span> <span class="o">=</span> <span class="n">Fraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">one_and_a_half</span> <span class="o">=</span> <span class="n">MixedFraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">half</span><span class="p">)</span></code></pre></figure>

<p>Now, our environment holds this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
                                                |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
&lt;Fraction&gt; object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)-------/
    |                                       |
    V                                       |
&lt;MixedFraction&gt; object #1:                  |
        Variable    | Type     | Value      |
        ------------------------------      |
        whole_num   | int      | 1          |
        fraction_obj| Fraction | *----------/
        to_float    | function |
</code></pre></div></div>

<p>If we were to, for example, call <code class="language-plaintext highlighter-rouge">to_float</code> on <code class="language-plaintext highlighter-rouge">one_and_a_half</code>, what would
happen?</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">z</span> <span class="o">=</span> <span class="n">one_and_a_half</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span></code></pre></figure>

<p>I’ll work this one step by step. I just ordered Jimmy John’s for
delivery so we got time.</p>

<p>First, we <em>ask</em> <code class="language-plaintext highlighter-rouge">one_and_a_half</code> to execute the <code class="language-plaintext highlighter-rouge">to_float</code> method. A new
temporary environment is created for it to work in, but isn’t very
interesting since <code class="language-plaintext highlighter-rouge">MixedFractions.to_float</code> needs no parameters:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
                                                |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
&lt;Fraction&gt; object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
&lt;MixedFraction&gt; object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-&gt; one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/
</code></pre></div></div>

<p>This should look familiar, because it is the same thing as when we did
<code class="language-plaintext highlighter-rouge">third.to_float()</code> before. However, the <code class="language-plaintext highlighter-rouge">MixedFractions</code> version of
<code class="language-plaintext highlighter-rouge">to_float</code> is a whole lot different when it executes.</p>

<p>Here’s <code class="language-plaintext highlighter-rouge">MixedFraction</code>’s <code class="language-plaintext highlighter-rouge">to_float</code> for reference:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">to_float</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">val</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">whole_num</span><span class="p">)</span>
            <span class="c1"># float() is a built-in function that
</span>            <span class="c1"># can convert integers to floats.
</span>
    <span class="n">val</span> <span class="o">+=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fraction_obj</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span>
            <span class="c1"># ask the fraction for its floating point value!
</span>
    <span class="k">return</span> <span class="n">ret</span></code></pre></figure>

<p>First, on line 2, it gets the floating point of the whole number part
and stores it to a variable cleverly named <code class="language-plaintext highlighter-rouge">val</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                           ...     ...
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
&lt;MixedFraction&gt; object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-&gt; one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/
            val         | float    | 1.0
</code></pre></div></div>

<p>Then, on line 6, it does something we haven’t seen before: double dots!
But by now, you should be able to smell what The Rock cookin’.</p>

<ol>
  <li>The first dot resolves <code class="language-plaintext highlighter-rouge">self</code> to the <code class="language-plaintext highlighter-rouge">MixedFraction</code> object.</li>
  <li>The second dot resolves <code class="language-plaintext highlighter-rouge">fraction_obj</code> to the <code class="language-plaintext highlighter-rouge">Fraction</code> object.</li>
  <li>Then, we ask <em>that</em> <code class="language-plaintext highlighter-rouge">Fraction</code> to execute <em>its</em> <code class="language-plaintext highlighter-rouge">to_float</code> method.</li>
</ol>

<p>By the time we’ve done all of that, we’ve got this mess:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                               ... ...
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
&lt;Fraction&gt; object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      +-------)---\
        numerator   | int      | 1          |       |   |
        denominator | int      | 2          |       |   |
        to_float    | function |            |       |   |
                                            |       |   |
                                            |       |   |
    /---------------------------------------)---+---/   |
    |                                       |   |       |
    V                                       |   |       |
&lt;MixedFraction&gt; object #1:                  |   |       |
        Variable    | Type     | Value      |   |       |
        ------------------------------      |   |       |
        whole_num   | int      | 1          |   |       |
        fraction_obj| Fraction | *----------/   |       |
        to_float    | function |                |       |
                                                |       |
                                                |       |
                                                |       |
Method call-&gt; one_and_a_half.to_float():        |       |
            Variable    | Type     | Value      |       |
            ------------------------------      |       |
            self        | MixedF...| *----------/       |
            val         | float    | 1.0                |
                                                        |
Method call-------&gt; self.fraction_obj.to_float()        |
                    Variable    | Type     | Value      |
                    ------------------------------      |
                    self        | Fraction | *----------/
</code></pre></div></div>

<p>UGH.</p>

<p>We are talking about line 6 still. Note that the environment for this
call has its <em>own</em> <code class="language-plaintext highlighter-rouge">self</code> within. That <code class="language-plaintext highlighter-rouge">self</code> is the <code class="language-plaintext highlighter-rouge">Fraction</code>.
Thankfully this method doesn’t do a whole whole lot and returns the
<code class="language-plaintext highlighter-rouge">Fraction</code> represented as a floating point value pretty much
immediately. So, that temporary environment is destroyed and we are left
with this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                           ...     ...
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
&lt;MixedFraction&gt; object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-&gt; one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/
            val         | float    | 1.5
</code></pre></div></div>

<p>Finally, we have our <code class="language-plaintext highlighter-rouge">MixedFraction</code> as a float, and this method call
environment returns <code class="language-plaintext highlighter-rouge">val</code> and is destroyed. Now we can update our global
environment with <code class="language-plaintext highlighter-rouge">z</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
        z               | float    | 1.5        |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
&lt;Fraction&gt; object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)-------/
    |                                       |
    V                                       |
&lt;MixedFraction&gt; object #1:                  |
        Variable    | Type     | Value      |
        ------------------------------      |
        whole_num   | int      | 1          |
        fraction_obj| Fraction | *----------/
        to_float    | function |
</code></pre></div></div>

<p>Awesome.</p>

<h1 id="inheritance">Inheritance</h1>

<p>What if we were drunk and decided to make <code class="language-plaintext highlighter-rouge">MixedFraction</code> inherit from
<code class="language-plaintext highlighter-rouge">Fraction</code>? That seems like a totally reasonable thing to do, right?
After all, isn’t a mixed fraction just a special representation of
a fraction?</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">MixedFraction</span><span class="p">(</span><span class="n">Fraction</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">whole_num</span><span class="p">,</span> <span class="n">numerator</span><span class="p">,</span> <span class="n">denominator</span><span class="p">):</span>
        <span class="n">new_num</span> <span class="o">=</span> <span class="n">numerator</span> <span class="o">+</span> <span class="p">(</span><span class="n">whole_num</span> <span class="o">*</span> <span class="n">denominator</span><span class="p">)</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">new_num</span><span class="p">,</span> <span class="n">denominator</span><span class="p">)</span></code></pre></figure>

<p>And look at that, we are pretty much done! MixedFraction will inherit
the Fraction version of to_float, and because of how we wrote our
constructors everything will just <em>work</em>. So what about this <code class="language-plaintext highlighter-rouge">super()</code>
business?</p>

<p>Let’s start with a clean environment and make ourselves a <code class="language-plaintext highlighter-rouge">MixedFraction</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">one_and_a_half</span> <span class="o">=</span> <span class="n">MixedFraction</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">taco</span> <span class="o">=</span> <span class="n">one_and_a_half</span><span class="p">.</span><span class="n">to_float</span><span class="p">()</span></code></pre></figure>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |

Method call-&gt; MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *------\
            whole_num   | int      | 1      |
            numerator   | int      | 1      |
            denominator | int      | 2      |
                                            |
    /---------------------------------------/
    |
    V
&lt;MixedFraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
</code></pre></div></div>

<p>When its constructor begins executing, we calculate a <code class="language-plaintext highlighter-rouge">new_num</code> value
that represents the whole number added back into the fraction’s numerator.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Method call-&gt; MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *------\
            whole_num   | int      | 1      |
            numerator   | int      | 1      |
            denominator | int      | 2      |
            new_num     | int      | 3      |
                                            |
    /---------------------------------------/
    |
    V
&lt;MixedFraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
</code></pre></div></div>

<p>Alright, now things get cray cray. We make a call to <code class="language-plaintext highlighter-rouge">super()</code>, and then
use the dot notation on that? What the…?</p>

<p>Since it is just a function call, what does <code class="language-plaintext highlighter-rouge">super()</code> return?  Well,
that’s for another discussion, but it returns something we can just call
the “super object”. The super object is an object that we can ask, just
as before, execute methods for us using methods <em>from the superclass of
the object we are in</em>. It allows us to call methods that exist in both
the class and the class inherited from.</p>

<p>In this instance, <code class="language-plaintext highlighter-rouge">super()</code> can basically operate as an alias for
<code class="language-plaintext highlighter-rouge">Fraction</code>, and as a way to tell Python how to use methods we have two of,
such as the constructor.</p>

<p>So, we make the call to the constructor of <code class="language-plaintext highlighter-rouge">Fraction</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Method call-&gt; MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *----------\
            whole_num   | int      | 1          |
            numerator   | int      | 1          |
            denominator | int      | 2          |
            new_num     | int      | 3          |
                                                |
Method call----&gt; Fraction.__init__(self, 3, 2)  |
                Variable    | Type     | Value  |
                ------------------------------  |
                self        | MixedF...| *------+
                numerator   | int      | 3      |
                denominator | int      | 2      |
                                                |
    /-------------------------------------------/
    |
    V
&lt;MixedFraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
</code></pre></div></div>

<p>Now we begin execution of the constructor of <code class="language-plaintext highlighter-rouge">Fraction</code>. Notice now how
the self within its environment is the <code class="language-plaintext highlighter-rouge">MixedFraction</code>! Baller! It
completes and is destroyed, leaving us this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Method call-&gt; MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *----------\
            whole_num   | int      | 1          |
            numerator   | int      | 1          |
            denominator | int      | 2          |
            new_num     | int      | 3          |
                                                |
    /-------------------------------------------/
    |
    V
&lt;MixedFraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 3
        denominator | int      | 2
        to_float    | function |
</code></pre></div></div>

<p>Anywhozzles, once the constructor of <code class="language-plaintext highlighter-rouge">MixedFraction</code> completes, we are
left with an environment that looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        one_and_a_half  | MixedF...| *------\
                                            |
    /---------------------------------------/
    |
    V
&lt;MixedFraction&gt; object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 3
        denominator | int      | 2
        to_float    | function |
</code></pre></div></div>

<p>Cool, right? Okay, my sandwich is here. Time to go. Until next time…</p>]]></content><author><name></name></author><category term="programming" /><category term="object oriented" /><category term="education" /><category term="python" /><category term="lecture notes" /><summary type="html"><![CDATA[Students often have trouble grasping the difference between objects, classes, and the variables which hold them. This article aims to explain object oriented programming by example in Python.]]></summary></entry></feed>