top-image

OLDER ARTICLES

php-med-trans-lightI have been working a lot of with PHP and GIS consulting for CitySquares and the History Engine. I found searching for everything I needed to do basic processing & Google Integration tedious and painful. So here is a collection of common functions that helped me get through the massaging of the data and ready for integration.

    pnPoly - Used to determine if a coordinate falls inside a polygon.
    Centroid - Find the center of a polygon..
    Area - Calculate the area of a polygon.
    googleGeoCoder - Extracts GIS information from Google Maps from an address.
    PolylineEncoder - Takes a set of coordinates and encodes it for Google Maps.

If you ran into the problem I did, which is that a lot of the data is coming in the form of shp/dbf files and needs to be parsed out to something friendlier either KML or CSV, there are a couple of solutions for that. You can parse out the data with shp2text if your source coordinate format is already in lat/lng or if you have different coordinate system and use ArcGIS, you can try the plugin Export to KML 2.5.3 to help with the exporting of data with the ESRI suite of products.

Once your data is in SQL, the following query is an example of distance sorting with SQL. You can grab a copy of the zip_codes database here and play around with it.

SELECT *,
sqrt((69.1 * ("37.6" - latitude)) * (69.1 * ("37.6" - latitude)) +
(53.0 * ("-77.6" - longitude)) * (53.0 * ("-77.6" - longitude)))
AS distance
FROM `zip_codes`
HAVING distance < 10
ORDER BY distance ASC

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event (link). A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in one category or another. The most common application of the filter is for identifying words that appear in spam versus legitimate emails. A word by itself is often times useless without the context it was used in.

There is a whole suite of tools that are able to break down content to help improve the filter by supplementing it not only with a database of words to categories, but also sets of N-gram derived from the text. There are several scripts out there that will help with this extraction and it offers a few more layers of depth for Bayesian filtering. One such tool is, Ngram Statistics Package (NSP) which is easy to install and run.

...click here to read more

In my experience the majority of web agencies and developers still do not take search seriously enough. Most businesses have very simple requests, "How do I show up for keyword for people in the area", "How do I show up higher than my competitor on searches", and "How do people find my site". The web is an economy and driving consumers to business on the internet is a highly desired skill set. Consistently controlling the results of Google will be impossible and there is always room for improvement for every site.

Every developer will grow their own set of tools, but the core components are available for free. Google offers analytics to take control of your traffic performance, sources, and patterns. There is also Adwords Keyword Tool, which will help you target search phrases, volume, and competition. Based on these factors and a list of similar keywords you will be able to identify good opportunities to compete for relevant traffic. There is also the Webmaster guidelines published by Google that will give you a general best practice for search engines.

This process requires a lot of patience. It takes time for changes to take shape and results are delivered. When making changes to any site or even designing a new site with SEO built in, user traffic is not going to happen right away. Seeing the results come in will trigger an OCD to check Analytics and forever make improvements and indentify new markets and opportunities. The vast majority of web sites are there for user consumption. SEO became big business when a lot of people all at once figured out that users translated to consumers.

Google is the search leader, therefore they offer the highest return. They control the flow of traffic on the internet. Luckily, they also published a search engine optimization starter guide in pdf format! This is the 101 of SEO and it will be pointless to try to chase down every obscure reference and tip on the countless SEO sites out there when the components to their content analysis is available all in one place. The document is a general overview but offers some very important best practice rules that are easy to implement:

Title Tags

- Choose a title that effectively communicates the topic of the page's content.
- Create unique title tags for each page
- Use brief, but descriptive titles (limit of 66 characters or 12 keywords)

Description Tags

- Accurately summarize the page's content
- Use unique descriptions for each page
- Avoid filling the description with only keywords
- Avoid copy and pasting the entire content of the document into the description meta tag

URL structure

- Use words in URLs
- Create a simple directory structure
- Provide one version of a URL to reach a document
- Many users expect lower-case URLs and remember them better)

Site Navigation

- Create a naturally flowing hierarchy
- Use mostly text for navigation
- Use "breadcrumb" navigation
- Put an HTML sitemap page on your site, and use an XML Sitemap file
- Consider what happens when a user removes part of your URL
- Have a useful 404 page

Anchor Text (Links)

- Choose descriptive text
- Write concise text
- Format links so they're easy to spot

Heading Text

- There are six sizes of heading tags, beginning with <h1>, the most important, and ending with <h6>, the least important.
- Imagine you're writing an outline
- Use headings sparingly across the page
- Avoid using heading tags only for styling text and not presenting structure
- Avoid excessively using heading tags throughout the page

Other Confirmed Ranking Factors

- Keyword in URL
- Keyword in Domain name
- Freshness of Pages
- Freshness - Amount of Content Change
- Freshness of Links
- Site Age
- Anchor text of inbound link
- Hilltop Algorithm
- Domain Registration Time

There is a lot of helpful content in the document but it does not go deep into the inner mechanics like other sites attempt to do. There are several sites out there that try to go beyond what has been published and into the details for generating traffic, you would just need to google "Google Ranking Factors". A lot of information came from when google released US Patent Application #20050071741.

Use the above as a baseline of the steps to get your site more traffic. This is a topic that is constantly being updated as search improves and requires a lot of time and research to do efficiently. Overhauling existing projects to meet the standards of today's crawlers is tedious, boring, and offers no immediate results. It has been something I avoided in the past, but for a web site to stay competitive and more importantly, be seen it has to be found. I find having some good rules in place for how to deal with SEO makes new projects going forward much easier to deal with.

A friend of mine and I recently started a new project. After kicking around several ideas we finally reached a consensus on applying software prediction to financial data. This has been pursued pretty heavily but from a home brew stand point, we wanted to make software that could compete by mashing up existing data and technology available on the internet to make competitive and functioning software.

We intend on predicting the movement of stocks based on real time content analysis. This requires a good deal of machine learning and historical data, but even good content analysis is not enough. Using Bayesian Filtering with noise word reduction we plan on processing historical data and assigning the content to one of three categories: moveup, movedown, nomove. In order to train the filters, past press releases will be inserted into the filter mashed up with the stock data to track how the markets reacted to the context of the content. Over time, the software will be able to recognize keywords that trigger positive versus negative emotion in the market that would drive the price one way or the other. A score can be applied much like spam scores are applied and this number can be used as part of a greater overall algorithm to determine an action.

Just to bring a few readers up to speed on exactly how this will be applied, take the following formula:

Rather than training it to recognize the probability of spam we train it to recognize the probability that the word will trigger positive stock movement:

  • p is the probability that the content will result in positive movement.
  • p1 is the probability p(S | W1) that it is positive knowing it contains a first word (for example "capital");
  • p2 is the probability p(S | W2) that it is positive knowing it contains a second word (for example "boosted");
  • etc...

The entire body of the content will be processed against a known database of words and the market reaction to the presence of those words. The basic Bayesian filtering will need to be extended to deal with phrase recognition but overall a solid proven technology for machine learning to build from.

This information by itself, is nothing revolutionary but with strong pattern analysis like candlestick pattern recognition and other market indicators it can be used to create an accurate trading platform for marginal gains which over time can offer pretty high returns. There is certainly a lot of potential for this if it works, but it heavily depends on working accurately and there will be a lot of trial and error in the process.

For more reading on the concepts and components behind this idea, check out:

The nice part is all the historical data is out there around the internet which makes back-testing and scoring very easy to do and there will need to be a lot of testing.

It has been about a month now since the roll out and you can see the traffic trends rising since we started this process back in January. At the rate google is crawilng the data, the projection is that traffic will continue to rise well into the fall as everything is indexed.

With that said, we are about to surpass several sites on the way of traffic including reddit.com, fark.com, mcdonalds.com, and ibm.com to name a few. As a developer, seeing the metrics come back helps motivate and encourage the work that I've done. Even now we are still dealing with speed bumps along the way. None of which are noticeable as far as traffic is concerned but this maintained scalability is certainly a huge task. Using Drupal as a back end has proven that there are several challenges with how we proceed going forward. We've decided to scrap the MySQL Master/Master replication due to Drupal's sequences tables and duplicate key problems. An issue easily fixed if only auto increment was used... but alas without rewriting a good chunk of the code base going forward we must adapt to Master/Slave Read/Write splitting. It seems a week does not go by without encountering a scaling/replication pitfall. Drupal's general compatibility attitude torwards their framework makes it very difficult to leverage any perticular technology like MySQL to it's maximum because the database layer is written with several database backends in mind. A word of caution going for other developers that when they plan on creating a high traffic web site, there is a point where an up front investment in the infrastructure and backend will pay off huge. I believe we're reaching that point.

The unfortunate part with rapid growth is if the team is capable of adjusting at the same pace. While there is only but so much that can be planned ahead, now more than ever it is important that issues are indetified long before the become customer facing because the stakes are so much higher. Despite a successful launch, there is still a lot more ahead. How much time do we invest into new features, maintenance, and re-writes? What takes a higher priority, growth or consumer experience? Do we have the resources to invest in research and development?

At the end of every milestone, I find it necessary everyone pats themselves on the back, take deep breath, regroup as a team, and the cycle begins all over again. The gaps in between the end of one project and a start of another is the most important time for management and development to be in step with each other so everyone can move forward rowing in the same direction. Revisit company values, mission statements, and have meaningfull follow up discussions on what went well and what didn't. If as a team there is no time allocated for dialogue, despite accomplishing the task at hand, the same problems will occur over and over again. Not all problems in development are technical-- process and communication are consistent issues that seems to always manifest one way or another when working in a collaborative enviroment and it's important to determine what works well in the current situation. What may have worked in the past on a project, at a previous job, or for one person might not work now.

Congratulations on a job well done, let's open the dialogue and relish in the reflection time... that went well, what now?

Page 3 of 7:« 1 2 3 4 5 6 »Last »
bottom-img