I've been working on some RSS/Atom blog aggregation software with my open source students. Recently we got everything working, and it let me do an analysis of the past 15 years of blogging by my students.
I wanted to answer the question, "What is a blog post?" That is, which HTML elements are used at all, and most often? Which are never used? My students have used every blogging platform you can think of over the years, from WordPress to Blogger to Medium, and many have rolled their own. Therefore, while not perfect, this is a pretty good view into what blogging software uses.
Analyzing many thousands of posts, and hundreds of thousands of elements, here's what I found. The top 5 elements account for 75% of all elements used. A blog post is mostly:
<br>
(35%)<p>
(18%)<a>
(10%)<div>
(15%)<li>
(8%)
I'm really surprised at <br>
being on top. The next 18% is made up of the following:
<td>
(3%)<strong>
(3%)<img>
(3%)<pre>
(2%)<code>
(2%)<b>
(1.5%)<em>
(1.3%)<ul>
(1.1%)<tr>
(1%)
And the remainder are all used infrequently (< 1%):
<h3>
<figure>
<i>
<h4>
<blockquote>
<ol>
<hr>
<table>
<tbody>
<th>
<h5>
<iframe>
<strike>
<h6>
<thead>
<caption>
It's intresting to see the order of the heading levels match their frequency. I'm also interested in what isn't here. In all these posts, there's no <span>
, ever.