Diagnosing livers with machine learning
We will examine an unlabeled dataset of recorded data about livers that are believed to be indicative of liver disorders/liver disease, as well as the frequency of drinks per day per person to create a model. We will use K-means clustering to find interesting groups/clusters within the dataset. We will also use cross validation and ensemble learning to fine-tune the model.
Exploring the data We will examine the traits of the liver dataset so we can understand the relationships in the data and understand the shape of the dataset.
Explaining Gradient Descent
A classic statistics problem In machine learning, problems are often framed as optimization problems. For example, let us take one of the simplest applications of supervised learning: the linear regression. It’s conceptually an easy problem - given a set of data points, create a function that best approximates these points. If you ever took statistics or have used excel, this is the line of best fit. Conceptually, this operation doesn’t seem too difficult. Note that this is a supervised learning problem - we need to have a set of observed data to “train” from. We use this data to create the line of best fit, which we can then subsequently use to make predictions given new parameters.
Debugging C with GDB and Valgrind
Debugging C can be a chore, but being able to pinpoint your memory leaks with Valgrind and monitoring the flow of your program with GDB (or LLDB) can speed up the development of your code significantly. It’s a significant improvement over sticking a bunch of printf statements in your code and taking them out before production (which should by no means be your sole tool for debugging).
Setup Installing gdb is pretty easy on both OS X and Linux. Use your favorite package manager to install gcc and gdb. You can also install the LLVM compiler collection which uses lldb as its debugging tool rather than gdb.
A Primer on Pointers
A lot of people have trouble grasping manual memory management when they first encounter it. The syntax can be a little confusing and debugging can be an extremely painful process - whether you have a segfault or Valgrind is complaining that you still have memory leaks.
Using pointers in C, while a little intimidating at first, is not as difficult as most people expect.
C gives you a lot of access to memory - you can allocate bytes for use in your program, you can deallocate them, you can directly access memory addresses in your program, and pass them around wherever you want. The possibilities are endless!
About
About I’m a quantitative developer in NYC. I studied Computer Science at Dartmouth College. I’m very into cycling, playing guitar, and photography.
I enjoy learning CS theory, in particular graphics and rendering. My research was on Monte Carlo sampling theory, and I hope to continue contributing to the field. Outside of graphics, I find compilers and type theory rather interesting, and I like interesting programming languages like Rust, Haskell, and Idris.
Projects
Projects diffsitter A semantic difftool that parses a file’s AST to compute more meaningful diffs.
Setting up Neovim
Note: You can find an updated version of this article here
Vim is an excellent text editor. It’s fast and has a light footprint, and tends to be installed in most Unix systems you’ll come across. You can use it across SSH, and once you get the hang of the keybindings, you’ll find that it’s actually very fast to use.
Neovim tries to strip some of the cruft of Vim. It’s completely asynchronous (though Vim also introduced async in Vim 8), so plugins shouldn’t block normal operations. It includes nice features like everything that’s found in the vim-sensible plugin and true 24-bit color support.