Tips and impressions from CUSP hackathon 2016-2017 series

At the NYU Center for Urban Science and Progress (CUSP) we hold regular hackdays for our students. As the 2016-2017 hackathon season ends, here are some of my impressions from the events we had, and some tips on what you, students, should work on and improve.

Feedback from 2017 CUSP hackathon program

The most noticeable flaws, according to not only me, but to all the spectators and judges I spoke with afterward, were in the presentations. You must step up your game on that! That is your main gateway to a job - and no matter how good your work is if you cannot present it well nobody will know! Small things like appropriate axis labels, and the appropriate font size in your plots, in your tables, or just running a spell check on your slides are super important! You will not be understood or taken seriously if you do not take care of these things. Every I see CUSP’s students’ presentations someone (sometimes everyone) gets called out on that! We do not call you out on that because we want to bug you, or because it is a pet peeve of ours, it is truly very very important. It’s a shame that Justin did not hold his public speaking seminar ”Even a Geek can Speak” (due to low interest among students…).

In my opinion, here are the three golden rules of giving a presentation:

  • Know your audience: pitch the content and style appropriately (e.g. if it is not a technical crowd you are speaking to, don’t be too technical). If it is a mixed audience that can be tricky, but it is never justified giving a talk that ½ (⅓…⅙) of the crowd does not understand or more importantly does not care about.

  • Engage your audience: think of your talk as a story. Most people tend to relate what they did in the order in which they did it, but that is not usually the right structure for a presentation. If I were giving a talk to agency clients and CUSP colleagues yesterday I would have framed it as “An agency came to us with the following question. They recommended the following data to answer the question. These is the dataset and these are the reasons why it is adequate and this is where it falls short (and I integrated it with these data). These are the solutions we designed, and this is what we obtained and where we fell short. This work should be continued as follows to achieve a better result. In summary: we had this question, we answered it as follows, we could improve the answer on these aspects.”

  • Be clear: Know everything that you put in your slides. Make sure the audience can see everything you put in your slide, and understand everything you put in your slides, ideally by just looking, definitely by looking and listening to your explanation. If it is in your slides, it should be discussed. Figures, and especially maps, are not the analysis and generally are not the result! use them as a prop to help the audience understand what you are describing, but they are not a substitute for your words (that is why I insist on captions in all PUI figures). Just like when designing a caption when you show a figure in your talk you also need to tell the audience WHY they are shown that figure, WHAT THEY SHOULD NOTICE in that figure, and WHAT IT MEANS. And you must describe your data! What information is available? Without knowing that the audience cannot understand (or evaluate) what you did at all.

  • Be concrete: give examples, and analogies so that concepts become intuitive. The more technical the tone of your talk the more this is needed. To make sure everyone is with you. If I tell you the closest star is 1.3 parsec, it is not as compelling as if I say traveling at the speed of light it would take you 4 years and 3 months to get there, and that it is 265,000 times as far as the Sun. If you say crime is bad somewhere, compare it to crime in an NYC neighborhood, where your audience has a sense of what it is. If you say it is hard to get to polling places, how long does it take, how much does the trip cost? Here https://medium.com/towards-data-science/enchanted-random-forest-b08d418cb411 is a blog post with analogies to understand random forests.

A lot of the groups were working with large datasets, sometimes really big data, and I have caught nearly all of you developing on the full dataset. You must reduce your data to prototype the solution, especially with limited time! You have 7 hours, you cannot stare at your screen waiting for a PCA to run for 20 minutes, without knowing if it is the right thing to do, or 15 minutes for a file to be parsed by pandas! Extract 1/10th of the data and prototype on that, copy the top 10% of the file to a new file (opening it in a text editor to make sure the file structure is preserved when you cut it down) and develop the code to extract info from the file on the reduced file. Prototype your solution, then apply it to the entire dataset.

Uncertainties!! I am a physicist and physicists have a mild obsession with uncertainties, but really a quantity shown without the uncertainty is meaningless! If the mean age is 50 with a standard deviation of 40, your story and your conclusion better be different than if the average age is 50 +/- 2! (As of today, I will make it a rule for myself when I show a map to always also show the map of the uncertainty in my first map)

PCA must be the most commonly misused analysis technique; many papers and documents have been written about it actually! PCA is powerful, but it serves very specific purposes (most commonly remove noise to avoid overfitting). It can be used as a dimensionality reduction technique, BUT THE RESULTING DIMENSIONS (features) ARE NOT GENERALLY A SUBSET OF THE ORIGINAL FEATURES, so in general it does not answer questions like: “which is the most important feature in my model” (decision trees, and simple regression can answer that question). In plain words: if you are trying to fit the outcome of something depending on race, income, gender, education, age, and several other variables PCA does not answer the question ”is race the main driver?” because the principal components (the new features) are NOT a subset of the original ones, but (linear) combinations thereof!

In general: you are learning a lot of skills and familiarizing with a lot of tools, but you need to pay attention to what is the appropriate tool to answer a question. That generally is a question that can be asked early, before all the data is clean and ready, so in a hackathon someone in the team should immediately start thinking of what are the appropriate analysis tools and figure out how to run the packages! I do recommend, among other things, to better understand the workflow of a scientist at work, that you do more reading of scientific papers, because if you want to be a scientist, you have to study the product of the work of scientists, which is not class material, but mostly it is articles.