====== Writing Papers (Dr. Denton) ====== Discussion on what to put into a paper. Also some notes on deadlines for the next papers. ===== Alan's Paper Writing Recommendations ===== * Use present tense * Avoid short paragraphs * Punctuate equations ===== Paper Components ===== Abstract, Introduction, and Conclusion are fairly fixed. The others are required but may be arranged as appropriate I. Abstract * Summary of the entire paper. Not an introduction, so maybe one line of motivation here. II. Introduction * Motivation section... 1.Why do people need it? 2.How is it new? 3.Why/how is it successful? III. Background 1. Related Works 2. Background formalism (e.g., ARM definitions) IV. Algorithm * Should first concentrate on results, then define the algorithms with really new math (should be about 2 pages) [This is a good milestone] V. Results * Plots- Good place to start the writing process (after motivation) * Additional experimental description can wait. VI. Conclusion 2. How is it new? Can be answered with: * More accurate (be careful, this is tempting but hard to achieve) * Better scaling/complexity * Solves a new problem * A new perspective to an old problem 3. How is it successful? Can be shown by: * Results from 2 * Better numbers for accuracy and complexity * New approach is better than naive approach * Naive must be practically naive (for completely new approaches can be very much so) * New solutions may only be able to compare to the latest old solution PLOTS: Probably need about as many as half the number of pages. For ICDM the pages is 12, so need about 6 figures. * One to show Quality * One to show Efficiency * One or two for partial results (performance key concepts) * One or two drawings of key concepts (such as graphical version of data or search space) Be sure that figures are legible at the size they will be printed/read at. This means legends should be correct (readable) size, text is generally kept at the in-text size, and series lines should be distinguishable. As Michael Stonebraker has said we need Respectable Graphs and Equations (RGE) * Graphs are as above to show results * Equations are the new math in the algorithm section ===== ICDM Paper Deadlines ===== Those submitting to the next conference (ICDM July 5) must meet the following deadlines: Next Tuesday (June 6): Five page draft for a target paper of 10-12 pages. * A couple of plots * Related work (about 8 references) * A start on new math * Work on the three introduction points * (May do abstract if you want) This should provide motivation for the completion of your theory and experiments. Typically you can expect 2 more weeks of writing if you have this preliminary introduction with a good start on respectable results. Note that Dr. Denton may do final introductions for first time students (or students in general), but you still should do introduction work as motivation and direction. ===== LaTex and Writing ===== We have a general standard of using LaTex, but it is not required. You may use Word if you are more comfortable (most writing is just text anyways). Keep in mind Dr. Denton would tend to contribute some formulae for respectable equations. In this case, she will still use hand-written or LaTex -- you must convert that into your format. Our current setup is to use a free version of MikTex along with any editor with syntax highlighting and macro ability (to compile the latex). Examples are TextPad with the appropriate style file (you will need to setup the compile macros), or WinTex on a cd in the lab, or a plug-in for the Eclipse editor. ===== SSE and ARM on Continuous Multi-dimenstional data (Matt) ===== An update on Matt's work. A big point is that typical statistics is concerned with the entire data set. We will focus on finding the most useful subset of the data using measures of sum of squared errors (SSE) and Person's Correlation (phi). Typical ARM finds itemsets that have support >= to the min. support. We are looking for data subsets that have a SSE <= max. SSE. More specific, the highest SSE below the threshold. The search has two aspects. 1) Intersection of transactions by growing the number of attributes considered. For example, attribute A and B have <= transactions as either A or B by themselves. Similar, thresholds of adjacent values for a specific attribute have a union relation to transactions. For example, A with range 1 or range 1 to 2 (i.e., A(1) or A(1,2)). The single attribute setup goes like:\\ '' 1\\ 2\\ 3\\ 12\\ 23\\ 123\\ '' Multiple attributes go like:\\ ''A(1)\\ A(12)\\ B(1)\\ C(1)\\ A(1)B(1)\\ :\\ :\\ '' So, SSE across attribute sets is upward closed while SSE within a single attribute is downward closed. Since we want the max. SSE below the threshold we could start with the largest set for each attribute and combine different ranges for each. Once we drop below the threshold for an attribute combination, we do not need to look at smaller single attribute set level (since the smaller ones will always have a smaller SSE). For example, if {A(23), B(12)} <= max SSE then we do not need to consider changing A to A(2) or A(3).