Jump to content

Convert manually a html file to csv


Fielding

Recommended Posts

just put commas between the words.

else, you'll need to use something like php if you have a lot of them.

Link to comment
Share on other sites

Since the file contains a series of rather complex word definitions such as...

<p><b><h1>shrugs
</h1></b></p><div class="source-data">
        <div class="def-list">
                                        <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <span class="dbox-pg">verb (used with object)</span>, <span class="dbox-bold">shrugged, </span><span class="dbox-bold" data-syllable="shrug·ging.">shrugging.</span>                    </header>

                                            
<div class="def-set">
    <span class="def-number">1.</span>
    <div class="def-content">
        to raise and contract (the shoulders), expressing indifference, disdain, etc.    </div>
</div>
                                    </section>
                            <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <span class="dbox-pg">verb (used without object)</span>, <span class="dbox-bold">shrugged, </span><span class="dbox-bold" data-syllable="shrug·ging.">shrugging.</span>                    </header>

                                            
<div class="def-set">
    <span class="def-number">2.</span>
    <div class="def-content">
        to raise and contract the shoulders.    </div>
</div>
                                    </section>
                            <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <span class="dbox-pg">noun</span>                    </header>

                                            
<div class="def-set">
    <span class="def-number">3.</span>
    <div class="def-content">
        the movement of raising and contracting the shoulders.    </div>
</div>
                                            
<div class="def-set">
    <span class="def-number">4.</span>
    <div class="def-content">
        a short sweater or jacket that ends above or at the waistline.    </div>
</div>
                                    </section>
                            <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <span class="dbox-pg">Verb phrases</span>                    </header>

                                            
<div class="def-set">
    <span class="def-number">5.</span>
    <div class="def-content">
        <span class="dbox-bold">shrug off, </span>            <ol class="def-sub-list">
                                    <li>
                        to disregard; minimize:                             <div class="def-block def-inline-example"><span class="dbox-example">to shrug off an insult.</span></div>
                                            </li>
                                    <li>
                        to rid oneself of:                             <div class="def-block def-inline-example"><span class="dbox-example">to shrug off the effects of a drug.</span></div>
                                            </li>
                            </ol>
                        </div>
</div>
                                    </section>
                    </div>

        <div class="tail-wrapper">

    

...are you saying you would like the above to be converted into...

shrugs,verb (used with object),shrugged,shrugging.,1.,to raise and contract (the shoulders),expressing indifference, disdain, etc.,verb (used without object),shrugged,shrugging.,2.,to raise and contract the shoulders.,3.,the movement of raising and contracting the shoulders.,4.,a short sweater or jacket that ends above or at the waistline.,,Verb phrases,5.,shrug off, to disregard; minimize:,to shrug off an insult.,to rid oneself of:,to shrug off the effects of a drug.

...this introduces several problems. For one thing there are embedded commas in the text. Also each element is variable in length.

Link to comment
Share on other sites

html h1 cannot be within a p paragraph element or visa versa.

If using php server script you could if you wrap each string of text in an specific element such as the hardly used '<b>...</b>', then replace these with double quotes, that will then cause the commas, and any quotes to be ignored, then use strip_tags to remove all html tags so quoted text remains, then separate those by comma delimiter.

Link to comment
Share on other sites

Example page;

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <meta name="viewport" id="viewport" content="target-densitydpi=high-dpi,initial-scale=1.0" />
        <title>Document Title</title>
    </head>
    <body>
        <h1><b>shrugsxxxxx</b>
        </h1>
        <div class="source-data">
            <div class="def-list">
                <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <b> <span class="dbox-pg">verb (used with object)</span>, <span class="dbox-bold">shrugged, </span><span class="dbox-bold" data-syllable="shrug·ging.">shrugging.</span> </b>                   </header>


                    <div class="def-set">
                        <b><span class="def-number">1.</span></b>
                        <div class="def-content">
                            <b>to raise and contract (the shoulders), expressing indifference, disdain, etc.    </b></div>
                    </div>
                </section>
                <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <b><span class="dbox-pg">verb (used without object)</span>, <span class="dbox-bold">shrugged, </span><span class="dbox-bold" data-syllable="shrug·ging.">shrugging.</span> </b>                   </header>


                    <div class="def-set">
                        <b><span class="def-number">2.</span></b>
                        <div class="def-content">
                            <b>to raise and contract the shoulders.</b>    </div>
                    </div>
                </section>
                <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <b><span class="dbox-pg">noun</span> </b>                   </header>


                    <div class="def-set">
                        <b><span class="def-number">3.</span></b>
                        <div class="def-content">
                            <b>the movement of raising and contracting the shoulders.</b>    </div>
                    </div>

                    <div class="def-set">
                        <b><span class="def-number">4.</span></b>
                        <div class="def-content">
                            <b>a short sweater or jacket that ends above or at the waistline.</b>    </div>
                    </div>
                </section>
                <section class="def-pbk ce-spot" data-collapse-expand='{"target": ".def-set", "type": "def"}'>
                    <header class="luna-data-header">
                        <b><span class="dbox-pg">Verb phrases</span> </b>                   </header>


                    <div class="def-set">
                        <b><span class="def-number">5.</span></b>
                        <div class="def-content">
                            <b><span class="dbox-bold">shrug off, </span></b>            <ol class="def-sub-list">
                                <li>
                                    <b>to disregard; minimize:</b>                            <div class="def-block def-inline-example">
                                        <b><span class="dbox-example">to shrug off an insult.</span></b></div>
                                </li>
                                <li>
                                    <b>to rid oneself of:</b>                             <div class="def-block def-inline-example"><b><span class="dbox-example">to shrug off the effects of a drug.</span></b></div>
                                </li>
                            </ol>
                        </div>
                    </div>
                </section>
            </div>

            <div class="tail-wrapper">
            </div>
        </div>
    </body>
</html>

Code to remove tags, replace comma with encoded html alternative, and replace closing bold element with comma delimiter, while removing opening bold tag. (php or html)

<?php

header("Content-Type: text/plain");
$c = curl_init('http://localhost/web_testing/example.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)

$html = curl_exec($c);

if (curl_error($c)) {
    die(curl_error($c));
} else {

    $html = strip_tags_content($html, '<title>', TRUE);
    $html = preg_replace('/\,+/', '&#44;', $html);
    $html = preg_replace('/\<b>+/', '', $html);
    $html = preg_replace('/\<\/b>+/', ',', $html);
    $html = strip_tags($html, '<br>');
    $html = preg_replace('/\s+/', ' ', $html);
    $html = rtrim($html);
    $html = rtrim($html, ',');

    //echo $html;

    $list[] = $html;

    $file = fopen("contacts.csv", "w");

    foreach ($list as $line) {
        fputcsv($file, explode(',', $line));
    }

    fclose($file);
}
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);

curl_close($c);

function strip_tags_content($text, $tags = '', $invert = FALSE) {

    preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
    $tags = array_unique($tags[1]);

    if (is_array($tags) AND count($tags) > 0) {
        if ($invert == FALSE) {
            return preg_replace('@<(?!(?:' . implode('|', $tags) . ')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
        } else {
            return preg_replace('@<(' . implode('|', $tags) . ')\b.*?>.*?</\1>@si', '', $text);
        }
    } elseif ($invert == FALSE) {
        return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
    }
    return $text;
}

contact.csv result

" shrugsxxxxx"," verb (used with object)&#44; shrugged&#44; shrugging. "," 1."," to raise and contract (the shoulders)&#44; expressing indifference&#44; disdain&#44; etc. "," verb (used without object)&#44; shrugged&#44; shrugging. "," 2."," to raise and contract the shoulders."," noun "," 3."," the movement of raising and contracting the shoulders."," 4."," a short sweater or jacket that ends above or at the waistline."," Verb phrases "," 5."," shrug off&#44; "," to disregard; minimize:"," to shrug off an insult."," to rid oneself of:"," to shrug off the effects of a drug."

code for reading csv file

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <meta name="viewport" id="viewport" content="target-densitydpi=high-dpi,initial-scale=1.0,user-scalable=no" />
        <title>Document Title</title>
    </head>
    <body>
        <?php
        $file = fopen("contacts.csv", "r");
        foreach (fgetcsv($file) as $f) {
            echo $f . '<br>';
        }

        fclose($file);
        ?>
    </body>
</html>

It will need adjusting to compensate for other elements, maybe, but it works, as i opened it in Excel.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...