PHP/KML Polyline Simplification with Douglas-Peucker
by Clay vanSchalkwijk on April 20, 2009
Quality GIS data sometimes comes with a lot more precision than what is usable for Google Maps (or other mapping software). The problem lies in the number of points representing a polygon that you want to overlay. A county representation for a state might include 100,000 points that is not usable without some form of reduction. Luckily there is an algorithm that solves that problem, Douglas-Peucker.
The algorithm simplifies a polyline by removing vertices that do not contribute (sufficiently) to the overall shape. It is a recursive process which finds the most important vertices for every given reduction. First, the most basic reduction is assumed. A single segment connecting the beginning and end of the original polyline. This is when the recursion starts, the most significant vertex (the most distant) for this segment is found and, when the distance from this vertex to the segment exceeds the reduction tolerance, the segment is split into two sub-segments, each inheriting a subset of the original vertex list. Each segment continues to subdivide until none of the vertices in the local list are further away than the tolerance value.
There is a PHP class that does just this: Douglas-Peucker Polyline Simplification in PHP by Anthony Cartmell. Based on the original quality of the data and tolerance level, I was able to achieve a 90-93% reduction in size. This reduction allows me to represent significantly more data at a reasonable performance level to clients. Keep in mind, that this reduction is removing data out of the coordinate array so the quality of your representation will go down with the tolerance and reduction being applied. I highly suggest that you play around with the tolerance until you find a good balance between data size and image quality.
PHP GIS Functions
by Clay vanSchalkwijk on April 14, 2009
I have been working a lot of with PHP and GIS consulting for CitySquares and the History Engine. I found searching for everything I needed to do basic processing & Google Integration tedious and painful. So here is a collection of common functions that helped me get through the massaging of the data and ready for integration.
- pnPoly - Used to determine if a coordinate falls inside a polygon.
Centroid - Find the center of a polygon..
Area - Calculate the area of a polygon.
googleGeoCoder - Extracts GIS information from Google Maps from an address.
PolylineEncoder - Takes a set of coordinates and encodes it for Google Maps.
If you ran into the problem I did, which is that a lot of the data is coming in the form of shp/dbf files and needs to be parsed out to something friendlier either KML or CSV, there are a couple of solutions for that. You can parse out the data with shp2text if your source coordinate format is already in lat/lng or if you have different coordinate system and use ArcGIS, you can try the plugin Export to KML 2.5.3 to help with the exporting of data with the ESRI suite of products.
Once your data is in SQL, the following query is an example of distance sorting with SQL. You can grab a copy of the zip_codes database here and play around with it.
SELECT *,
sqrt((69.1 * ("37.6" - latitude)) * (69.1 * ("37.6" - latitude)) +
(53.0 * ("-77.6" - longitude)) * (53.0 * ("-77.6" - longitude)))
AS distance
FROM `zip_codes`
HAVING distance < 10
ORDER BY distance ASC
Bayesian filter training with N-gram
by Clay vanSchalkwijk on March 31, 2009
Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event (link). A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in one category or another. The most common application of the filter is for identifying words that appear in spam versus legitimate emails. A word by itself is often times useless without the context it was used in.
There is a whole suite of tools that are able to break down content to help improve the filter by supplementing it not only with a database of words to categories, but also sets of N-gram derived from the text. There are several scripts out there that will help with this extraction and it offers a few more layers of depth for Bayesian filtering. One such tool is, Ngram Statistics Package (NSP) which is easy to install and run.
SEO: Taking control of search
by Clay vanSchalkwijk on March 30, 2009
In my experience the majority of web agencies and developers still do not take search seriously enough. Most businesses have very simple requests, "How do I show up for keyword for people in the area", "How do I show up higher than my competitor on searches", and "How do people find my site". The web is an economy and driving consumers to business on the internet is a highly desired skill set. Consistently controlling the results of Google will be impossible and there is always room for improvement for every site.
Every developer will grow their own set of tools, but the core components are available for free. Google offers analytics to take control of your traffic performance, sources, and patterns. There is also Adwords Keyword Tool, which will help you target search phrases, volume, and competition. Based on these factors and a list of similar keywords you will be able to identify good opportunities to compete for relevant traffic. There is also the Webmaster guidelines published by Google that will give you a general best practice for search engines.
This process requires a lot of patience. It takes time for changes to take shape and results are delivered. When making changes to any site or even designing a new site with SEO built in, user traffic is not going to happen right away. Seeing the results come in will trigger an OCD to check Analytics and forever make improvements and indentify new markets and opportunities. The vast majority of web sites are there for user consumption. SEO became big business when a lot of people all at once figured out that users translated to consumers.
Google is the search leader, therefore they offer the highest return. They control the flow of traffic on the internet. Luckily, they also published a search engine optimization starter guide in pdf format! This is the 101 of SEO and it will be pointless to try to chase down every obscure reference and tip on the countless SEO sites out there when the components to their content analysis is available all in one place. The document is a general overview but offers some very important best practice rules that are easy to implement:
Title Tags
- Choose a title that effectively communicates the topic of the page's content.
- Create unique title tags for each page
- Use brief, but descriptive titles (limit of 66 characters or 12 keywords)
Description Tags
- Accurately summarize the page's content
- Use unique descriptions for each page
- Avoid filling the description with only keywords
- Avoid copy and pasting the entire content of the document into the description meta tag
URL structure
- Use words in URLs
- Create a simple directory structure
- Provide one version of a URL to reach a document
- Many users expect lower-case URLs and remember them better)
Site Navigation
- Create a naturally flowing hierarchy
- Use mostly text for navigation
- Use "breadcrumb" navigation
- Put an HTML sitemap page on your site, and use an XML Sitemap file
- Consider what happens when a user removes part of your URL
- Have a useful 404 page
Anchor Text (Links)
- Choose descriptive text
- Write concise text
- Format links so they're easy to spot
Heading Text
- There are six sizes of heading tags, beginning with <h1>, the most important, and ending with <h6>, the least important.
- Imagine you're writing an outline
- Use headings sparingly across the page
- Avoid using heading tags only for styling text and not presenting structure
- Avoid excessively using heading tags throughout the page
Other Confirmed Ranking Factors
- Keyword in URL
- Keyword in Domain name
- Freshness of Pages
- Freshness - Amount of Content Change
- Freshness of Links
- Site Age
- Anchor text of inbound link
- Hilltop Algorithm
- Domain Registration Time
There is a lot of helpful content in the document but it does not go deep into the inner mechanics like other sites attempt to do. There are several sites out there that try to go beyond what has been published and into the details for generating traffic, you would just need to google "Google Ranking Factors". A lot of information came from when google released US Patent Application #20050071741.
Use the above as a baseline of the steps to get your site more traffic. This is a topic that is constantly being updated as search improves and requires a lot of time and research to do efficiently. Overhauling existing projects to meet the standards of today's crawlers is tedious, boring, and offers no immediate results. It has been something I avoided in the past, but for a web site to stay competitive and more importantly, be seen it has to be found. I find having some good rules in place for how to deal with SEO makes new projects going forward much easier to deal with.
Bayesian Filtering & Financial Applications
by Clay vanSchalkwijk on March 27, 2009
A friend of mine and I recently started a new project. After kicking around several ideas we finally reached a consensus on applying software prediction to financial data. This has been pursued pretty heavily but from a home brew stand point, we wanted to make software that could compete by mashing up existing data and technology available on the internet to make competitive and functioning software.
We intend on predicting the movement of stocks based on real time content analysis. This requires a good deal of machine learning and historical data, but even good content analysis is not enough. Using Bayesian Filtering with noise word reduction we plan on processing historical data and assigning the content to one of three categories: moveup, movedown, nomove. In order to train the filters, past press releases will be inserted into the filter mashed up with the stock data to track how the markets reacted to the context of the content. Over time, the software will be able to recognize keywords that trigger positive versus negative emotion in the market that would drive the price one way or the other. A score can be applied much like spam scores are applied and this number can be used as part of a greater overall algorithm to determine an action.
Just to bring a few readers up to speed on exactly how this will be applied, take the following formula:
![]()
Rather than training it to recognize the probability of spam we train it to recognize the probability that the word will trigger positive stock movement:
- p is the probability that the content will result in positive movement.
- p1 is the probability p(S | W1) that it is positive knowing it contains a first word (for example "capital");
- p2 is the probability p(S | W2) that it is positive knowing it contains a second word (for example "boosted");
- etc...
The entire body of the content will be processed against a known database of words and the market reaction to the presence of those words. The basic Bayesian filtering will need to be extended to deal with phrase recognition but overall a solid proven technology for machine learning to build from.
This information by itself, is nothing revolutionary but with strong pattern analysis like candlestick pattern recognition and other market indicators it can be used to create an accurate trading platform for marginal gains which over time can offer pretty high returns. There is certainly a lot of potential for this if it works, but it heavily depends on working accurately and there will be a lot of trial and error in the process.
For more reading on the concepts and components behind this idea, check out:
- Naive Bayes Classifier
- Candlestick Patterns
- Candlestick Charting
- Better Bayesian Filtering
- TD Ameritrade API
The nice part is all the historical data is out there around the internet which makes back-testing and scoring very easy to do and there will need to be a lot of testing.